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Preface to the Third Edition 


The Third Edition of Testing Statistical Hypotheses brings it into consonance 
with the Second Edition of its companion volume on point estimation (Lehmann 
and Casella, 1998) to which we shall refer as TPE2. We won’t here comment on 
the long history of the book which is recounted in Lehmann (1997) but shall use 
this Preface to indicate the principal changes from the 2nd Edition. 

The present volume is divided into two parts. Part I (Chapters 1 10) treats 
small-sample theory, while Part II (Chapters 11 15) treats large-sample theory. 
The preface to the 2nd Edition stated that “the most important omission is an 
adequate treatment of optimality paralleling that given for estimation in TPE 
We shall here remedy this failure by treating the difficult topic of asymptotic 
optimality (in Chapter 13) together with the large-sample tools needed for this 
purpose (in Chapters 11 and 12). Having developed these tools, we use them in 
Chapter 14 to give a much fuller treatment of tests of goodness of fit than was 
possible in the 2nd Edition, and in Chapter 15 to provide an introduction to 
the bootstrap and related techniques. Various large-sample considerations that 
in the Second Edition were discussed in earlier chapters now have been moved to 
Chapter 11. 

Another major addition is a more comprehensive treatment of multiple testing 
including some recent optimality results. This topic is now presented in Chapter 
9. In order to make room for these extensive additions, we had to eliminate some 
material found in the Second Edition, primarily the coverage of the multivariate 
linear hypothesis. 

Except for some of the basic results from Part I, a detailed knowledge of small- 
sample theory is not required for Part II. In particular, the necessary background 
should include: Chapter 3, Sections 3.1 3.5, 3.8-3.9; Chapter 4: Sections 4.1-4.4; 
Chapter 5, Sections 5.1-5.3; Chapter 6, Sections 6.1-6.2; Chapter 7, Sections 
7.1 -7.2; Chapter 8, Sections 8.1-8.2, 8.4-8.5. 



viii Preface 


Of the two principal additions to the Third Edition, multiple comparisons 
and asymptotic optimality, each has a godfather. The development of multiple 
comparisons owes much to the 1953 volume on the subject by John Tukey, a 
mimeographed version which was widely distributed at the time. It was officially 
published only in 1994 as Volume VIII in The Collected Works of John W. Tukey. 

Many of the basic ideas on asymptotic optimality are due to the work of Le 
Cam between 1955 and 1980. It culminated in his 1986 book, Asymptotic Methods 
in Statistical Decision Theory. 

The work of these two authors, both of whom died in 2000, spans the achieve¬ 
ments of statistics in the second half of the 20th century, from model-free 
data analysis to the most abstract and mathematical asymptotic theory. In ac¬ 
knowledgment of their great accomplishments, this volume is dedicated to their 
memory. 

Special thanks to Noureddine El Karoui, Matt Finkelman, Brit Katzen, Mee 
Young Park, Elizabeth Purdom, Armin Schwartzman, Azeem Shaikh and the 
many students at Stanford University who proofread several versions of the new 
chapters and worked through many of the over 300 new problems. The support 
and suggestions of our colleagues is greatly appreciated, especially Persi Diaco- 
nis, Brad Efron, Susan Holmes, Balasubramanian Narasimhan, Dimitris Politis, 
Julie Shaffer, Guenther Walther and Michael Wolf. Finally, heartfelt thanks go to 
friends and family who provided continual encouragement, especially Ann Marie 
and Mark Hodges, David Fogle, Scott Madover, David Olachea, Janis and Jon 
Squire, Lucy, and Ron Susek. 

E. L. Lehmann 
Joseph P. Romano 


January, 2005 
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1.1 Statistical Inference and Statistical Decisions 

The raw material of a statistical investigation is a set of observations; these are 
the values taken on by random variables A' whose distribution Pg is at least 
partly unknown. Of the parameter 9, which labels the distribution, it is assumed 
known only that it lies in a certain set 17, the parameter space. Statistical infer¬ 
ence is concerned with methods of using this observational material to obtain 
information concerning the distribution of X or the parameter 9 with which it is 
labeled. To arrive at a more precise formulation of the problem we shall consider 
the purpose of the inference. 

The need for statistical analysis stems from the fact that the distribution of X, 
and hence some aspect of the situation underlying the mathematical model, is not 
known. The consequence of such a lack of knowledge is uncertainty as to the best 
mode of behavior. To formalize this, suppose that a choice has to be made between 
a number of alternative actions. The observations, by providing information about 
the distribution from which they came, also provide guidance as to the best 
decision. The problem is to determine a rule which, for each set of values of the 
observations, specifies what decision should be taken. Mathematically such a rule 
is a function 5, which to each possible value x of the random variables assigns a 
decision d = 5(x), that is, a function whose domain is the set of values of X and 
whose range is the set of possible decisions. 

In order to see how 5 should be chosen, one must compare the consequences of 
using different rules. To this end suppose that the consequence of taking decision d 
when the distribution of X is Pg is a loss, which can be expressed as a nonnegative 
real number L(9,d). Then the long-term average loss that would result from 
the use of 5 in a number of repetitions of the experiment is the expectation 
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E[L(6, 5(X))] evaluated under the assumption that Pg is the true distribution of 
X. This expectation, which depends on the decision rule S and the distribution 
Pg, is called the risk function of <5 and will be denoted by R(0, 8 ). By basing the 
decision on the observations, the original problem of choosing a decision d with 
loss function L{9, d) is thus replaced by that of choosing 8, where the loss is now 
R(0,8). 

The above discussion suggests that the aim of statistics is the selection of 
a decision function which minimizes the resulting risk. As will be seen later, 
this statement of aims is not sufficiently precise to be meaningful; its proper 
interpretation is in fact one of the basic problems of the theory. 


1.2 Specification of a Decision Problem 

The methods required for the solution of a specific statistical problem depend 
quite strongly on the three elements that define it: the class V = {Pg, 9 £ 17} to 
which the distribution of X is assumed to belong; the structure of the space D 
of possible decisions d; and the form of the loss function L. In order to obtain 
concrete results it is therefore necessary to make specific assumptions about these 
elements. On the other hand, if the theory is to be more than a collection of 
isolated results, the assumptions must be broad enough either to be of wide 
applicability or to define classes of problems for which a unified treatment is 
possible. 

Consider first the specification of the class V. Precise numerical assumptions 
concerning probabilities or probability distributions are usually not warranted. 
However, it is frequently possible to assume that certain events have equal prob¬ 
abilities and that certain other are statistically independent. Another type of 
assumption concerns the relative order of certain infinitesimal probabilities, for 
example the probability of occurrences in an interval of time or space as the 
length of the internal tends to zero. The following classes of distributions are 
derived on the basis of only such assumptions, and are therefore applicable in a 
great variety of situations. 

The binomial distribution b(p, n) with 

P(X = x),~ M/Cl-p)"- 1 , a: = 0,..., n. 0<p<l. (1.1) 

This is the distribution of the total number of successes in n independent trials 
when the probability of success for each trial is p. 

The Poisson distribution P(r) with 

P(X = *) = 1 —e~ T , * = 0,1,..., 0 < r. (1.2) 

x\ 

This is the distribution of the number of events occurring in a fixed interval of 
time or space if the probability of more than one occurrence in a very short 
interval is of smaller order of magnitude than that of a single occurrence, and if 
the numbers of events in nonoverlapping intervals are statistically independent. 
Under these assumptions, the process generating the events is called a Poisson 
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process. Such processes are discussed, for example, in the books by Feller (1968), 
Ross (1996), and Taylor and Karlin (1998). 

The normal distribution N(£,a 2 ) with probability density 


p{x) 


1 

%/27T<7 


exp 




—oo<*,£<oo, 0 < <j. (1.3) 


Under very general conditions, which are made precise by the central limit the¬ 
orem, this is the approximate distribution of the sum of a large number of 
independent random variables when the relative contribution of each term to 
the sum is small. 

We consider next the structure of the decision space D. The great variety of 
possibilities is indicated by the following examples. 


Example 1.2.1 Let Xi ,..., X n be a sample from one of the distributions (1.1)- 
(1.3), that is let the X’s be distributed independently and identically according 
to one of these distributions. Let 6 be p, r, or the pair (£, <r) respectively, and let 
7 = 7 ( 0 ) be a real-valued function of 6. 

(i) If one wishes to decide whether or not 7 exceeds some specified value 70 , 
the choice lies between the two decisions do : 7 > 70 and di : 7 < 70 - In specific 
applications these decisions might correspond to the acceptance or rejection of a 
lot of manufactured goods, of an experimental airplane as ready for flight testing, 
of a new treatment as an improvement over a standard one, and so on. The loss 
function of course depends on the application to be made. Typically, the loss is 0 
if the correct decision is chosen, while for an incorrect decision the losses L( 7 , do) 
and L( 7 , di) are increasing functions of I 7 — 701 . 

(ii) At the other end of the scale is the much more detailed problem of ob¬ 
taining a numerical estimate of 7 . Here a decision d of the statistician is a real 
number, the estimate of 7 , and the losses might be L( 7 , d) = v( 7 )io(|d — 7 I), 
where in is a strictly increasing function of the error |d — 7 |. 

(iii) An intermediate case is the choice between the three alternatives do : 
7 < 70 , di : 7 > 71 , d 2 : 70 < 7 < 71 , for example accepting a new treatment, 
rejecting it, or recommending it for further study. ■ 

The distinction illustrated by this example is the basis for one of the princi¬ 
pal classifications of statistical methods. Two-decision problems such as (i) are 
usually formulated in terms of testing a hypothesis which is to be accepted or 
rejected (see Chapter 3). It is the theory of this class of problems with which we 
shall be mainly concerned here. The other principal branch of statistics is the 
theory of point estimation dealing with problems such as (ii). This is the subject 
of TPE2. The intermediate problem (iii) is a special case of a multiple decision 
procedure. Some problems of this kind are treated in Ferguson (1967, Chapter 6 ); 
a discussion of some others is given in Chapter 9. 


Example 1.2.2 Suppose that the data consist of samples Xij,j = 1 
from normal populations IV(£j, a 2 ), i = 1,..., s. 

(i) Consider first the case s = 2 and the question of whether or not there is 
a material difference between the two populations. This has the same structure 
as problem (iii) of the previous example. Here the choice lies between the three 
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decisions do : |^2 — £l| < A, di : £2 > £1 + A, (fe : £2 < £1 — A, where A is 
preassigned. An analogous problem, involving k + 1 possible decisions, occurs 
in the general case of k populations. In this case one must choose between the 
decision that the k distributions do not differ materially, do : max|£,- — £j| < A, 
and the decisions du : max|£j — £*| > A and £k is the largest of the means. 

(ii) A related problem is that of ranking the distributions in increasing order 
of their mean £. 

(iii) Alternatively, a standard £0 may be given and the problem is to decide 
which, if any, of the population means exceed the standard. ■ 

Example 1.2.3 Consider two distributions—to be specific, two Poisson distri¬ 
butions -P(ti), P(to) — and suppose that ti is known to be less than to but that 
otherwise the r’s are unknown. Let Z \,..., Z n be independently distributed, each 
according to either P(n) or P(to). Then each Z is to be classified as to which 
of the two distributions it comes from. Here the loss might be the number of Z’s 
that are incorrectly classified, multiplied by a suitable function of n and to- An 
example of the complexity that such problems can attain and the conceptual as 
well as mathematical difficulties that they may involve is provided by the efforts 
of anthropologists to classify the human population into a number of homoge¬ 
neous races by studying the frequencies of the various blood groups and of other 
genetic characters. ■ 

All the problems considered so far could be termed action problems. It was 
assumed in all of them that if 9 were known a unique correct decision would 
be available, that is, given any 9, there exists a unique d for which L(9, d) = 0. 
However, not all statistical problems are so clear-cut. Frequently it is a question 
of providing a convenient summary of the data or indicating what information 
is available concerning the unknown parameter or distribution. This information 
will be used for guidance in various considerations but will not provide the sole 
basis for any specific decisions. In such cases the emphasis is on the inference 
rather than on the decision aspect of the problem. Although formally it can still 
be considered a decision problem if the inferential statement itself is interpreted as 
the decision to be taken, the distinction is of conceptual and practical significance 
despite the fact that frequently it is ignored. 1 An important class of such problems, 
estimation by interval, is illustrated by the following example. (For the more usual 
formulation in terms of confidence intervals, see Sections 3.5, 5.4 and 5.5.) 

Example 1.2.4 Let X = (Xi, ..., X n ) be a sample from N( £, a 2 ) and let a de¬ 
cision consist in selecting an interval [L, L\ and stating that it contains £. Suppose 
that decision procedures are restricted to intervals [L[X),L( A')] whose expected 
length for all £ and a does not exceed kn where k is some preassigned constant. 
An appropriate loss function would be 0 if the decision is correct and would oth¬ 
erwise depend on the relative position of the interval to the true value of £. In 
this case there are many correct decisions corresponding to a given distribution 
7V(£,a 2 ). ■ 


^■For a more detailed discussion of this distinction see, for example, Cox (1958), Blyth 
(1970), and Barnett (1999). 
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It remains to discuss the choice of loss function, and of the three elements 
defining the problem this is perhaps the most difficult to specify. Even in the 
simplest case, where all losses eventually reduce to financial ones, it can hardly 
be expected that one will be able to evaluate all the short- and long-term con¬ 
sequences of an action. Frequently it is possible to simplify the formulation by 
taking into account only certain aspects of the loss function. As an illustration 
consider Example 1.2.1(i) and let L(6, do) = a for 7 (9) < 70 and L(9 , di) = b for 
7 ( 6 ) > 70. The risk function becomes 


f aP g {5(X) = d 0 } if 7 < 70 , 
\ bP e {S(X) = d 1 } if 7 > 70 , 


(1.4) 


and is seen to involve only the two probabilities of error, with weights which 
can be adjusted according to the relative importance of these errors. Simi¬ 
larly, in Example 1.2.3 one may wish to restrict attention to the number of 
misclassifications. 

Unfortunately, such a natural simplification is not always available, and in the 
absence of specific knowledge it becomes necessary to select the loss function 
in some conventional way, with mathematical simplicity usually an important 
consideration. In point estimation problems such as that considered in Example 
1 . 2 . 1 (h), if one is interested in estimating a real-valued function 7 = 7 (8), it is 
customary to take the square of the error, or somewhat more generally to put 

L(8,d) = v(8)(d- 7 ) 2 . (1.5) 


Besides being particularly simple mathematically, this can be considered as an 
approximation to the true loss function L provided that for each fixed 8, L{8, d) 
is twice differentiable in d, that L(9,^(9)) = 0 for all 9, and that the error is not 
large. 

It is frequently found that, within one problem, quite different types of losses 
may occur, which are difficult to measure on a common scale. Consider once 
more Example 1.2.1 (i) and suppose that 70 is the value of 7 when a standard 
treatment is applied to a situation in medicine, agriculture, or industry. The 
problem is that of comparing some new process with unknown 7 to the standard 
one. Turning down the new method when it is actually superior, or adopting it 
when it is not, clearly entails quite different consequences. In such cases it is 
sometimes convenient to treat the various loss components, say L\, L 2 , • • •, L r , 
separately. Suppose in particular that r = 2 and the L\ represents the more 
serious possibility. One can then assign a bound to this risk component, that is, 
impose the condition 


EL 1 (8,5{X)) < a, 


( 1 . 6 ) 


and subject to this condition minimize the other component of the risk. Example 
1.2.4 provides an illustration of this procedure. The length of the interval [L, L] 
(measured in a-units) is one component of the loss function, the other being the 
loss that results if the interval does not cover the true £. 
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1.3 Randomization; Choice of Experiment 

The description of the general decision problem given so far is still too narrow in 
certain respects. It has been assumed that for each possible value of the random 
variables a definite decision must be chosen. Instead, it is convenient to permit the 
selection of one out of a number of decisions according to stated probabilities, or 
more generally the selection of a decision according to a probability distribution 
defined over the decision space; which distribution depends of course on what 
x is observed. One way to describe such a randomized procedure is in terms of 
a nonrandomized procedure depending on X and a random variable Y whose 
values lie in the decision space and whose conditional distribution given x is 
independent of 8. 

Although it may run counter to one’s intuition that such extra randomiza¬ 
tion should have any value, there is no harm in permitting this greater freedom 
of choice. If the intuitive misgivings are correct, it will turn out that the op¬ 
timum procedures always are of the simple nonrandomized kind. Actually, the 
introduction of randomized procedures leads to an important mathematical sim¬ 
plification by enlarging the class of risk functions so that it becomes convex. In 
addition, there are problems in which some features of the risk function such as 
its maximum can be improved by using a randomized procedure. 

Another assumption that tacitly has been made so far is that a definite experi¬ 
ment has already been decided upon so that it is known what observations will be 
taken. However, the statistical considerations involved in designing an experiment 
are no less important than those concerning its analysis. One question in par¬ 
ticular that must be decided before an investigation is undertaken is how many 
observations should be taken so that the risk resulting from wrong decisions will 
not be excessive. Frequently it turns out that the required sample size depends 
on the unknown distribution and therefore cannot be determined in advance as 
a fixed number. Instead it is then specified as a function of the observations and 
the decision whether or not to continue experimentation is made sequentially at 
each stage of the experiment on the basis of the observations taken up to that 
point. 

Example 1.3.1 On the basis of a sample Xi ,..., X n from a normal distribution 
N(£,a 2 ) one wishes to estimate £. Here the risk function of an estimate, for 
example its expected squared error, depends on a. For large a the sample contains 
only little information in the sense that two distributions IV(£i, a 2 ) and N(^ 2 ,ct 2 ) 
with fixed difference £2 — 0 become indistinguishable as a —> 00 , with the result 
that the risk tends to infinity. Conversely, the risk approaches zero as a —> 0, 
since then effectively the mean becomes known. Thus the number of observations 
needed to control the risk at a given level is unknown. However, as soon as some 
observations have been taken, it is possible to estimate a 2 and hence to determine 
the additional number of observations required. ■ 

Example 1.3.2 In a sequence of trials with constant probability p of success, 
one wishes to decide whether p < | or p > \. It will usually be possible to reach a 
decision at an early stage if p is close to 0 or 1 so that practically all observations 
are of one kind, while a larger sample will be needed for intermediate values of 
p. This difference may be partially balanced by the fact that for intermediate 
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values a loss resulting from a wrong decision is presumably less serious than for 
the more extreme values. ■ 

Example 1.3.3 The possibility of determining the sample size sequentially is 
important not only because the distributions Pe can be more or less informative 
but also because the same is true of the observations themselves. Consider, for 
example, observations from the uniform distribution over the interval (6 — 9 + 

|) and the problem of estimating 9. Here there is no difference in the amount 
of information provided by the different distributions Pe- However, a sample 
Xi, X' 2 , • • ■, X n can practically pinpoint 9 if inax|X,- — Xi\ is sufficiently close 
to 1, or it can give essentially no more information then a single observation if 
max | Xj — Xi\ is close to 0. Again the required sample size should be determined 
sequentially. ■ 

Except in the simplest situations, the determination of the appropriate sample 
size is only one aspect of the design problem. In general, one must decide not 
only how many but also what kind of observations to take. In clinical trials, for 
example, when a new treatment is being compared with a standard procedure, 
a protocol is required which specifies to which of the two treatments each of the 
successive incoming patients is to be assigned. Formally, such questions can be 
subsumed under the general decision problem described at the beginning of the 
chapter, by interpreting X as the set of all available variables, by introducing 
the decisions whether or not to stop experimentation at the various stages, by 
specifying in case of continuance which type of variable to observe next, and by 
including the cost of observation in the loss function. 

The determination of optimum sequential stopping rules and experimental 
designs is outside the scope of this book. An introduction to this subject is 
provided, for example, by Siegmund (1985). 


1.4 Optimum Procedures 

At the end of Section 1.1 the aim of statistical theory was stated to be the 
determination of a decision function S which minimizes the risk function 

R(9,5) = E e [L(9,5( X))}. (1.7) 

Unfortunately, in general the minimizing 5 depends on 9, which is unknown. 
Consider, for example, some particular decision do, and the decision procedure 
S(x ) = do according to which decision do is taken regardless of the outcome 
of the experiment. Suppose that do is the correct decision for some 9o, so that 
L(9o,do) = 0. Then S minimizes the risk at 9o since R(9o, <5) = 0, but presumably 
at the cost of a high risk for other values of 9. 

In the absence of a decision function that minimizes the risk for all 9, the 
mathematical problem is still not defined, since it is not clear what is meant 
by a best procedure. Although it does not seem possible to give a definition of 
optimality that will be appropriate in all situations, the following two methods 
of approach frequently are satisfactory. 

The nonexistence of an optimum decision rule is a consequence of the possibil¬ 
ity that a procedure devotes too much of its attention to a single parameter value 
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at the cost of neglecting the various other values that might arise. This suggests 
the restriction to decision procedures which possess a certain degree of impar¬ 
tiality, and the possibility that within such a restricted class there may exist a 
procedure with uniformly smallest risk. Two conditions of this kind, invariance 
and unbiasedness, will be discussed in the next section. 

Instead of restricting the class of procedures, one can approach the problem 
somewhat differently. Consider the risk functions corresponding to two different 
decision rules <5i and 82 ■ If R{9, <5i) < R(9, 82 ) for all 9 , then 5i is clearly preferable 
to 82 , since its use will lead to a smaller risk no matter what the true value of 
9 is. However, the situation is not clear when the two risk functions intersect 
as in Figure 1.1. What is needed is a principle which in such cases establishes a 
preference of one of the two risk functions over the other, that is, which introduces 
an ordering into the set of all risk functions. A procedure will then be optimum if 
its risk function is best according to this ordering. Some criteria that have been 
suggested for ordering risk functions will be discussed in Section 1.6. 



Figure 1.1. 

A weakness of the theory of optimum procedures sketched above is its de¬ 
pendence on an extraneous restricting or ordering principle, and on knowledge 
concerning the loss function and the distributions of the observable random 
variables which in applications is frequently unavailable or unreliable. These diffi¬ 
culties, which may raise doubt concerning the value of an optimum theory resting 
on such shaky foundations, are in principle no different from those arising in any 
application of mathematics to reality. Mathematical formulations always involve 
simplification and approximation, so that solutions obtained through their use 
cannot be relied upon without additional checking. In the present case a check 
consists in an overall evaluation of the performance of the procedure that the 
theory produces, and an investigation of its sensitivity to departure from the 
assumptions under which it was derived. 

The optimum theory discussed in this book should therefore not be understood 
to be prescriptive. The fact that a procedure 8 is optimal according to some 
optimality criterion does not necessarily mean that it is the right procedure to 
use, or even a satisfactory procedure. It does show how well one can do in this 
particular direction and how much is lost when other aspects have to be taken 
into account. 
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The aspect of the formulation that typically has the greatest influence on the 
solution of the optimality problem is the family V to which the distribution 
of the observations is assumed to belong. The investigation of the robustness 
of a proposed procedure to departures from the specified model is an indis¬ 
pensable feature of a suitable statistical procedure, and although optimality 
(exact or asymptotic) may provide a good starting point, modifications are of¬ 
ten necessary before an acceptable solution is found. It is possible to extend the 
decision-theoretic framework to include robustness as well as optimality. Suppose 
robustness is desired against some class V' of distributions which is larger (possi¬ 
bly much larger) than the give V. Then one may assign a bound M to the risk to 
be tolerated over V' ■ Within the class of procedures satisfying this restriction, one 
can then optimize the risk over V as before. Such an approach has been proposed 
and applied to a number of specific problems by Bickcl (1984) and Kempthorne 
(1988). 

Another possible extension concerns the actual choice of the family V , the 
model used to represent the actual physical situation. The problem of choosing 
a model which provides an adequate description of the situation without being 
unnecessarily complex can be treated within the decision-theoretic formulation 
of Section 1.1 by adding to the loss function a component representing the com¬ 
plexity of the proposed model. Such approaches to model selection are discussed 
in Stone (1981), de Leeuw (1992) and Rao and Wu (2001). 


1.5 Invariance and Unbiasedness 2 

A natural definition of impartiality suggests itself in situations which are sym¬ 
metric with respect to the various parameter values of interest: The procedure is 
then required to act symmetrically with respect to these values. 

Example 1.5.1 Suppose two treatments are to be compared and that each is 
applied n times. The resulting observations An,..., Ai„ and A' 21 ,..., A 2 „ are 
samples from A(£i,cr 2 ) and IV(£ 2 , c 2 ) respectively. The three available decisions 
are do : IC 2 — £i| < A, di : £2 > £1 + A, d 2 : £2 < £1 — A, and the loss is Wij if 
decision dj is taken when d; would have been correct. If the treatments are to be 
compared solely in terms of the £’s and no outside considerations are involved, 
the losses are symmetric with respect to the two treatments so that w 01 = 1002 , 
Wio = W 20 , W 12 = w 21 . Suppose now that the labeling of the two treatments as 
1 and 2 is reversed, and correspondingly also the labeling of the A’s, the £’s, 
and the decisions di and d 2 . This changes the meaning of the symbols, but the 
formal decision problem, because of its symmetry remains unaltered. It is then 
natural to require the corresponding symmetry from the procedure 5 and ask that 
<S(®ii, * * *, *rin, ^ 21 , - - * t 2 n) — do, di, or d 2 as d(x 21 ,..., % 2 n , sen ,... , £in) — do, 
d 2 , or di respectively If this condition were not satisfied, the decision as to 
which population has the greater mean would depend on the presumably quite 


2 The concepts discussed here for general decision theory will be developed in more 
specialized form in later chapters. The present section may therefore be omitted at first 
reading. 
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accidental and irrelevant labeling of the samples. Similar remarks apply to a 
number of further symmetries that are present in this problem. ■ 

Example 1.5.2 Consider a sample X\, ..., X n from a distribution with density 
cr -1 /[(a; — £)/cr] and the problem of estimating the location parameter £, say the 
mean of the A’s, when the loss is (d — £) 2 /cr 2 , the square of the error expressed 
in cr-units. Suppose that the observations are originally expressed in feet, and 
let X[ = aX with a = 12 be the corresponding observations in inches. In the 
transformed problem the density is <r , ~ 1 /[(a: / — ^ / )/cr'] with £' = ata' = aa. 
Since (d! — £') 2 /cr' 2 = (d — l;) 2 /a 2 , the problem is formally unchanged. The 
same estimation procedure that is used for the original observations is therefore 
appropriate after the transformation and leads to <5(aAT,..., aX n ) as an estimate 
of £' = a£, the parameter £ expressed in inches. On reconverting the estimate into 
feet one finds that if the result is to be independent of the scale of measurements, 
S must satisfy the condition of scale invariance 

l?(aA ' 1 ’- a '’ a ' Y,l) = S(X U X n ) . U 

The general mathematical expression of symmetry is invariance under a suit¬ 
able group of transformations. A group G of transformations g of the sample 
space is said to leave a statistical decision problem invariant if it satisfies the 
following conditions: 

(i) It leaves invariant the family of distributions V = {Pg,9 £ S2}, that is, for 
any possible distribution Pg of A' the distribution of gX. say Pg>, is also in 
V. The resulting mapping 9 1 = g9 of Q. is assumed to be onto 3 Q, and 1:1. 

(ii) To each g £ G, there corresponds a transformation g* = h(g) of the decision 
space D onto itself such that ft is a homomorphism, that is, satisfies the 
relation h(<?i<? 2 ) = /i(gi)/i(</ 2 ), and the loss function L is unchanged under 
the transformation, so that 

L(g9,g*d) = L(9,d). 

Under these assumptions the transformed problem, in terms of X' = gX, 9' = 
g 8 , and d' = g*d, is formally identical with the original problem in terms of 
X , 9 , and d. Given a decision procedure 5 for the latter, this is therefore still 
appropriate after the transformation. Interpreting the transformation as a change 
of coordinate system and hence of the names of the elements, one would, on 
observing x', select the decision which in the new system has the name 8 (x'), 
so that its old name is g*~ 1 5(x'). If the decision taken is to be independent of 
the particular coordinate system adopted, this should coincide with the original 
decision 5(x), that is, the procedure must satisfy the invariance condition 

8 (gx) = g*5(x) for all x £ A, g £ G. (1-8) 

Example 1.5.3 The model described in Example 1.5.1 is invariant also under 
the transformations X[j = AT, + c, + c. Since the decisions do, di, and d 2 

3 The term onto is used in indicate that gVl is not only contained in but actually 
equals Q; that is, given any 9' in f2, there exists 0 in 11 such that g9 = 9'. 
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concern only the differences £2 — £ 1 , they should remain unchanged under these 
transformations, so that one would expect to have g*di = di for i = 0,1,2. It is in 
fact easily seen that the loss function does satisfy L(g 8 ,d) = L( 8 ,d), and hence 
that g*d — d. A decision procedure therefore remains invariant in the present 
case if it satisfies 5(gx) = 5(x) for all g £ G, x € X. ■ 

It is helpful to make a terminological distinction between situations like that 
of Example 1.5.3 in which g*d = d for all d, and those like Examples 1.5.1 
and 1.5.2 where invariance considerations require S(gx) to vary with g. In the 
former case the decision procedure remains unchanged under the transformations 
X' = gX and is thus truly invariant; in the latter, the procedure varies with g 
and may then more appropriately be called equivariant rather than invariant. 
Typically, hypothesis testing leads to procedures that are invariant in this sense; 
estimation problems (whether by point or interval estimation), to equivariant 
ones. Invariant tests and equivariant confidence sets will be discussed in Chapter 
6. For a brief discussion of equivariant point estimation, see Bondessen (1983); a 
fuller treatment is given in TPE2, Chapter 3. 

Invariance considerations are applicable only when a problem exhibits certain 
symmetries. An alternative impartiality restriction which is applicable to other 
types of problems is the following condition of unbiasedness. Suppose the problem 
is such that for each 8 there exists a unique correct decision and that each decision 
is correct for some 8 . Assume further that L( 8 i,d) = L(02,d) for all d whenever 
the same decision is correct for both 8 \ and 82 . Then the loss L( 8 ,d') depends 
only on the actual decision taken, say d! , and the correct decision d. The loss can 
thus be denoted by L(d,d') and this function measures how far apart d and d' 
are. Under these assumptions a decision function 5 is said to be unbiased with 
respect to the loss function L, or L-unbiased, if for all 8 and d' 

E e L(d',S{X)) > EgL{d,S(X)) 

where the subscript 8 indicates the distribution with respect to which the ex¬ 
pectation is taken and where d is the decision that is correct for 8 . Thus 5 is 
unbiased if on the average <5(X) comes closer to the correct decision than to any 
wrong one. Extending this definition, <5 is said to be L-unbiased for an arbitrary 
decision problem if for all 8 and 8 ' 

EgL( 8 \ S(X)) > EgL( 8 , 6 {X)). (1.9) 


Example 1.5.4 Suppose that in the problem of estimating a real-valued param¬ 
eter 8 by confidence intervals, as in Example 1.2.4, the loss is 0 or 1 as the interval 
[L, L] does or does not cover the true 8 . Then the set of intervals [L(X), L(X)] 
is unbiased if the probability of covering the true value is greater than or equal 
to the probability of covering any false value. ■ 


Example 1.5.5 In a two-decision problem such as that of Example 1.2.1 (i), let 
coo and wi be the sets of 0-values for which do and di are the correct decisions. 
Assume that the loss is 0 when the correct decision is taken, and otherwise is 
given by L(8 , do) = a for 8 € u>i, and L(8, di) = b for 8 £ u>o- Then 


EgL(8',5(X)) 


aPg{ 6 (X) = d 0 } if 8 'eui, 
bPg{5{X) = di} if 8' £u> 0 , 
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so that (1.9) reduces to 

aP e {S{X) = d 0 } > bPg{S(X) = di} for 9' € w 0 , 

with the reverse inequality holding for 6 £ u>i. Since Pe{<5(X) = do} + Pe{d(A') = 
di} = 1, the unbiasedness condition (1.9) becomes 

Pe{6(X) = dr} < ^5 for 9 e u*>, 

P„{d(X) = di} > ^ for 9€a)i.l U ; 

Example 1.5.6 In the problem of estimating a real-valued function 7 (9) with 
the square of the error as loss, the condition of unbiasedness becomes 

E g [5{X) -7(d')] 2 > Eg[6(X) — 7 (0)] 2 for all 9,9'. 

On adding and subtracting h(9) = EgS(X) inside the brackets on both sides, this 
reduces to 

[ h(9) — 7 [9')] 2 > [h(9) — 7 (9)] 2 for all 9, 9'. 

If h(9) is one of the possible values of the function 7, this condition holds if and 
only if 

E e 5{X) = 7(0) . ■ (1.11) 

In the theory of point estimation, (1.11) is customarily taken as the definition of 
unbiasedness. Except under rather pathological conditions, it is both a necessary 
and sufficient condition for 8 to satisfy (1.9). (See Problem 1.2.) 


1.6 Bayes and Minimax Procedures 

We now turn to a discussion of some preference orderings of decision procedures 
and their risk functions. One such ordering is obtained by assuming that in re¬ 
peated experiments the parameter itself is a random variable O, the distribution 
of which is known. If for the sake of simplicity one supposes that this distribution 
has a probability density p(9), the overall average loss resulting from the use of 
a decision procedure <5 is 

r{p,5) = J E e L(9,S(X))p{9)d9 = J R(9,8)p(9)d9 (1.12) 

and the smaller r(p,S), the better is 8. An optimum procedure is one that 
minimizes r(p,8), and is called a Bayes solution of the given decision problem 
corresponding to a priori density p. The resulting minimum of r(p,8) is called 
the Bayes risk of 8. 

Unfortunately, in order to apply this principle it is necessary to assume not 
only that 9 is a random variable but also that its distribution is known. This 
assumption is usually not warranted in applications. Alternatively, the right-hand 
side of (1.12) can be considered as a weighted average of the risks; for p(9) = 1 in 
particular, it is then the area under the risk curve. With this interpretation the 
choice of a weight function p expresses the importance the experimenter attaches 
to the various values of 9. A systematic Bayes theory has been developed which 
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interprets p as describing the state of mind of the investigator towards 9. For an 
account of this approach see, for example, Berger (1985a) and Robert (1994). 

If no prior information regarding 9 is available, one might consider the max¬ 
imum of the risk function its most important feature. Of two risk functions the 
one with the smaller maximum is then preferable, and the optimum procedures 
are those with the minimax property of minimizing the maximum risk. Since 
this maximum represents the worst (average) loss that can result from the use 
of a given procedure, a minimax solution is one that gives the greatest possible 
protection against large losses. That such a principle may sometimes be quite un¬ 
reasonable is indicated in Figure 1.2, where under most circumstances one would 
prefer <$i to 82 although its risk function has the larger maximum. 



Figure 1.2. 

Perhaps the most common situation is one intermediate to the two just de¬ 
scribed. On the one hand, past experience with the same or similar kind of 
experiment is available and provides an indication of what values of 6 to ex¬ 
pect; on the other, this information is neither sufficiently precise nor sufficiently 
reliable to warrant the assumptions that the Bayes approach requires. In such 
circumstances it seems desirable to make use of the available information without 
trusting it to such an extent that catastrophically high risks might result if it is 
inaccurate or misleading. To achieve this one can place a bound on the risk and 
restrict consideration to decision procedures 5 for which 

R(9,8)<C for all 9. (1.13) 

[Here the constant C will have to be larger than the maximum risk Co of the min¬ 
imax procedure, since otherwise there will exist no procedures satisfying (1.13).] 
Having thus assured that the risk can under no circumstances get out of hand, 
the experimenter can now safely exploit his knowledge of the situation, which 
may be based on theoretical considerations as well as on past experience; he can 
follow his hunches and guess at a distribution p for 9. This leads to the selection 
of a procedure 8 (a restricted Bayes solution), which minimizes the average risk 
(1.12) for this a priori distribution subject to (1.13). The more certain one is of 
p, the larger one will select C, thereby running a greater risk in case of a poor 
guess but improving the risk if the guess is good. 

Instead of specifying an ordering directly, one can postulate conditions that the 
ordering should satisfy. Various systems of such conditions have been investigated 
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and have generally led to the conclusion that the only orderings satisfying these 
systems are those which order the procedures according to their Bayes risk with 
respect to some prior distribution of 9. For details, see for example Blackwell and 
Girshick (1954), Ferguson (1967), Savage (1972), Berger (1985a), and Bernardo 
and Smith (1994). 


1.7 Maximum Likelihood 


Another approach, which is based on considerations somewhat different from 
those of the preceding sections, is the method of maximum likelihood. It has 
led to reasonable procedures in a great variety of problems, and is still playing 
a dominant role in the development of new tests and estimates. Suppose for 
a moment that A' can take on only a countable set of values * 1 , 3 : 2 ,..., with 
Pg(x) = Pg{X = *}, and that one wishes to determine the correct value of 9, 
that is, the value that produced the observed x. This suggests considering for 
each possible 9 how probable the observed x would be if 9 were the true value. 
The higher this probability, the more one is attracted to the explanation that the 
9 in question produced x, and the more likely the value of 9 appears. Therefore, 
the expression Pg(x) considered for fixed * as a function of 9 has been called 
the likelihood of 9. To indicate the change in point of view, let it be denoted 
by L x (9). Suppose now that one is concerned with an action problem involving 
a countable number of decisions, and that it is formulated in terms of a gain 
function (instead of the usual loss function), which is 0 if the decision taken is 
incorrect and is a{9 ) > 0 if the decision taken is correct and 9 is the true value. 
Then it seems natural to weight the likelihood L x {9) by the amount that can 
be gained if 9 is true, to determine the value of 9 that maximizes a(9)L x {9) 
and to select the decision that would be correct if this were the true value of 9. 
Essentially the same remarks apply in the case in which Pg(x) is a probability 
density rather than a discrete probability. 

In problems of point estimation, one usually assumes that a(9) is independent 
of 9. This leads to estimating 9 by the value that maximizes the likelihood L x (9), 
the maximum-likelihood estimate of 9. Another case of interest is the class of 
two-decision problems illustrated by Example 1.2.1 (i). Let u>o and a>i denote the 
sets of 0-values for which do and di are the correct decisions, and assume that 
a(9) = ao or ai as 9 belongs to u>o or u>i respectively. Then decision do or di is 
taken as ai sup egUJl L x (9 ) < or > ao sup eewo L x (9), that is as 


sup L x (9) 

0£ujo 

sup L x (9) 



(1.14) 


This is known as a likelihood ratio procedure. 4 


4 This definition differs slightly from the usual one where in the denominator on the 
left-hand side of (1.14) the supremum is taken over the set woUa;i. The two definitions 
agree whenever the left-hand side of (1.14) is < 1, and the procedures therefore agree is 
ai < ao- 
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Although the maximum likelihood principle is not based on any clearly defined 
optimum considerations, it has been very successful in leading to satisfactory 
procedures in many specific problems. For wide classes of problems, maximum 
likelihood procedures will be shown in Chapter 13 to possess various asymptotic 
optimum properties as the sample size tends to infinity; also see TPE2, Chapter 
6 . On the other hand, there exist examples for which the maximum-likelihood 
procedure is worse than useless; where it is, in fact, so bad that one can do better 
without making any use of the observations (see Problem 6.28). 


1.8 Complete Classes 

None of the approaches described so far is reliable in the sense that the resulting 
procedure is necessarily satisfactory. There are problems in which a decision pro¬ 
cedure Jo exists with uniformly minimum risk among all unbiased or invariant 
procedures, but where there exists a procedure Ji not possessing this particular 
impartiality property and preferable to Jo- (Cf. Problems 1.14 and 1.16.) As was 
seen earlier, minimax procedures can also be quite undesirable, while the success 
of Bayes and restricted Bayes solutions depends on a priori information which 
is usually not very reliable if it is available at all. In fact, it seems that in the 
absence of reliable a priori information no principle leading to a unique solution 
can be entirely satisfactory. 

This suggests the possibility, at least as a first step, of not insisting on a unique 
solution but asking only how far a decision problem can be reduced without loss 
of relevant information. It has already been seen that a decision procedure J can 
sometimes be eliminated from consideration because there exists a procedure J' 
dominating it in the sense that 

R(9,5')<R{9,6) for all 6 

R(9,S') < R(9,5 ) for some 9. J 

In this case J is said to be inadmissible ; J is called admissible if no such dominating 
S' exists. A class C of decision procedures is said to be complete if for any J not 
in C there exists J' in C dominating it. A complete class is minimal if it does not 
contain a complete subclass. If a minimal complete class exists, as is typically 
the case, it consists exactly of the totality of admissible procedures. 

It is convenient to define also the following variant of the complete class notion. 
A class C is said to be essentially complete if for any procedure J there exists 
S' in C such that R(9,S') < R(9,S) for all 9. Clearly, any complete class is also 
essentially complete. In fact, the two definitions differ only in their treatment of 
equivalent decision rules, that is, decision rules with identical risk function. If J 
belongs to the minimal complete class C, any equivalent decision rule must also 
belong to C. On the other hand, a minimal essentially complete class need contain 
only one member from such a set of equivalent procedures. 

In a certain sense a minimal essentially complete class provides the maximum 
possible reduction of a decision problem. On the one hand, there is no reason 
to consider any of the procedures that have been weeded out. For each of them, 
there is included one in C that is as good or better. On the other hand, it is not 
possible to reduce the class further. Given any two procedures in C, each of them 
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is better in places than the other, so that without additional information it is not 
known which of the two is preferable. 

The primary concern in statistics has been with the explicit determination of 
procedures, or classes of procedures, for various specific decision problems. Those 
studied most extensively have been estimation problems, and problems involving 
a choice between only two decisions (hypothesis testing), the theory of which 
constitutes the subject of the present volume. However, certain conclusions are 
possible without such specialization. In particular, two results concerning the 
structure of complete classes and minimax procedures have been proved to hold 
under very general assumptions. 5 

(i) The totality of Bayes solutions and limits of Bayes solutions constitute a 
complete class. 

(ii) Minimax procedures are Bayes solutions with respect to a least favorable a 
priori distribution, that is, an a priori distribution that maximizes the as¬ 
sociated Bayes risk, and the minimax risk equals this maximum Bayes risk. 
Somewhat more generally, if there exists no least favorable a priori distribu¬ 
tion but only a sequence for which the Bayes risk tends to the maximum, the 
minimax procedures are limits of the associated sequence of Bayes solutions. 


1.9 Sufficient Statistics 

A minimal complete class was seen in the preceding section to provide the 
maximum possible reduction of a decision problem without loss of information. 
Frequently it is possible to obtain a less extensive reduction of the data, which 
applies simultaneously to all problems relating to a given class V = {Pg, (9 £ fl} 
of distributions of the given random variable X. It consists essentially in discard¬ 
ing that part of the data which contains no information regarding the unknown 
distribution Pg, and which is therefore of no value for any decision problem 
concerning 9. 


Example 1.9.1 Trials are performed with constant unknown probability p of 
success. If Xi is 1 or 0 as the ith trial is a success or failure, the sample 
(Xi,...,X n ) shows how many successes there were and in which trials they 
occurred. The second of these pieces of information contains no evidence as to 
the value of p. Once the total number of successes ^2 Xi is known to be equal to 
t , each of the (") possible positions of these successes is equally likely regardless 
of p. It follows that knowing X % but neither the individual Xi nor p, one can, 
from a table of random numbers, construct a set of random variables X [,..., X' n 
whose joint distribution is the same as that of X±,... ,X n . Therefore, the infor¬ 
mation contained in the X. t is the same as that contained in Xi and a table of 
random numbers. ■ 


5 Precise statements and proofs of these results are given in the book by Wald (1950). 
See also Ferguson (1967) and Berger (1985a). Additional results and references are given 
in Brown and Marden (1989) and Kowalski (1995). 
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Example 1.9.2 If Xi,... , X n are independently normally distributed with zero 
mean and variance a 2 , the conditional distribution of the sample point over each 
of the spheres, X 2 = constant, is uniform irrespective of a 2 . One can therefore 
construct an equivalent sample X{,.... X' n from a knowledge of X 2 and a 
mechanism that can produce a point randomly distributed over a sphere. ■ 

More generally, a statistic T is said to be sufficient for the family V = {Pe, 9 £ 
SI} (or sufficient for 6, if it is clear from the context what set SI is being considered) 
if the conditional distribution of X given T = t is independent of 9. As in the two 
examples it then follows under mild assumptions 6 that it is not necessary to utilize 
the original observations X. If one is permitted to observe only T instead of X, 
this does not restrict the class of available decision procedures. For any value t of 
T let Xt be a random variable possessing the conditional distribution of X given t. 
Such a variable can, at least theoretically, be constructed by means of a suitable 
random mechanism. If one then observes T to be t and Xt to be x' , the random 
variable X' defined through this two-stage process has the same distribution as 
X. Thus, given any procedure based on X, it is possible to construct an equivalent 
one based on X' which can be viewed as a randomized procedure based solely 
on T. Hence if randomization is permitted (and we shall assume throughout that 
this is the case), there is no loss of generality in restricting consideration to a 
sufficient statistic. 

It is inconvenient to have to compute the conditional distribution of X given 
t in order to determine whether or not T is sufficient. A simple check is provided 
by the following factorization criterion. 

Consider first the case that X is discrete, and let Pe(x) = Pg{X = *}. Then a 
necessary and sufficient condition for T to be sufficient for 9 is that there exists 
a factorization 


Pe{x) = g 9 [T{x)]h(x), 


(1.16) 


where the first factor may depend on 9 but depends on x only through T(x), 
while the second factor is independent of 9. 

Suppose that (1.16) holds, and let T{x) = t. Then Pg{T = f} = Po{x') 
summed over all points x' with T(x') = t, and the conditional probability 


Pe{X = x\T = t} 


Pe (x) 
Pe{T = t} 


h(x) 

T,h{x') 


is independent of 9. Conversely, if this conditional distribution does not depend 
on 9 and is equal to, say k(x,t), then Pg(x) = Pg{T = t}k(x,t), so that (1.16) 
holds. 


Example 1.9.3 Let Xi,...,X n be independently and identically distributed 
according to the Poisson distribution (1.2). Then 

Pt{x 1 , . . . , Xn ) — n , 

n -c ! 

1=1 


6 These are connected with difficulties concerning the behavior of conditional prob¬ 
abilities. For a discussion of these difficulties see Sections 2.3—2.5. 
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and it follows that X, is a sufficient statistic for r. ■ 


In the case that the distribution of X is continuous and has probability density 
pf (x), let X and T be vector-valued, X = {X\,... , X n ) and T = (Ti ,.. .Tt) say. 
Suppose that there exist functions Y = (Yi,..., Y n - r ) on the sample space such 
that the transformation 


{xi,.. .,x n ) <-»• (Tl(x), ... ,T r {x), Yi(x),... ,Y n - r (x)) (1.17) 


is 1:1 on a suitable domain, and that the joint density of T and Y exists and is 
related to that of X by the usual formula 

Pe(x)=Pe’ Y (T(x),Y(x))-\J\, (1.18) 

where J is the Jacobian of (Ti,..., TV, hi,..., Y n - r ) with respect to (xi,... ,x„). 
Thus in Example 1.9.2, T = y/^Xf , hi,..., Y n -i can be taken to be the polar 
coordinates of the sample point. From the joint density ( t, y) of T and Y, 

the conditional density of Y given T = t is obtained as 


vV\v) 


Pe’ Y (t,y ) 

/ P^’ Y {t,y')dy' 


(1.19) 


provided the denominator is different from zero. Regularity conditions for the 
validity of (1.18) are given by Tukey (1958b). 

Since in the conditional distribution given t only the Y’s vary, T is sufficient 
for 6 if the conditional distribution of Y given t is independent of 9. Suppose 
that T satisfies (1.19). Then analogously to the discrete case, a necessary and 
sufficient condition for T to be sufficient is a factorization of the density of the 
form 


Pe (*) = ge[T(x)]h(x). (1.20) 

(See Problem 1.19.) The following two examples illustrate the application of the 
criterion in this case. In both examples the existence of functions Y satisfying 
(1.17)—(1.19) will be assumed but not proved. As will be shown later (Section 
2 .6), this assumption is actually not needed for the validity of the factorization 
criterion. 


Example 1.9.4 Let Xi,...,X n be independently distributed with normal 
probability density 

Pi Ax) = (27RT 2 )“ n/2 exp (“ 2^2 ^ • 

Then the factorization criterion shows (X) Xi, X) Xf) to be sufficient for (£, a). ■ 

Example 1.9.5 Let Xi,...,A'„ be independently distributed according to the 
uniform distribution U(0,9) over the interval (0, 6) . Then pe (x) = 9~" (max Xi,6), 
where u(a, b) is 1 or 0 as a < b or a > b, and hence max X, is sufficient for 6. ■ 

An alternative criterion of Bayes sufficiency, due to Kolmogorov (1942), pro¬ 
vides a direct connection between this concept and some of the basic notions 
of decision theory. As in the theory of Bayes solutions, consider the unknown 
parameter 6 as a random variable 0 with an a priori distribution, and assume 
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for simplicity that it has a density p{6). Then if T is sufficient, the conditional 
distribution of 0 given X = x depends only on T(x). Conversely, if p{9) ^ 0 for 
all 9 and if the conditional distribution of 0 given x depends only on T(x), then 
T is sufficient for 6. 

In fact, under the assumptions made, the joint density of X and 0 is pg(x)p(9). 
If T is sufficient, it follows from (1.20) that the conditional density of 0 given 
x depends only on T(x). Suppose, on the other hand, that for some a priori 
distribution for which p(9) ^ 0 for all 9 the conditional distribution of 0 given x 
depends only on T(x). Then 


pg{x)p(9) 
f Pe' ( x)p(9')d9 1 


fe[T(x)\ 


and by solving for pg(x) it is seen that T is sufficient. 

Any Bayes solution depends only on the conditional distribution of 0 given 
x (see Problem 1.8) and hence on T(x). Since typically Bayes solutions together 
with their limits form an essentially complete class, it follows that this is also 
true of the decision procedures based on T. The same conclusion had already 
been reached more directly at the beginning of the section. 

For a discussion of the relation of these different aspects of sufficiency in more 
general circumstances and references to the literature see Le Cam (1964), Roy 
and Ramamoorthi (1979) and Yamada and Morimoto (1992). An example of a 
statistic which is Bayes sufficient in the Kolmogorov sense but not according to 
the definition given at the beginning of this section is provided by Blackwell and 
Ramamoorthi (1982). 

By restricting attention to a sufficient statistic, one obtains a reduction of 
the data, and it is then desirable to carry this reduction as far as possible. To 
illustrate the different possibilities, consider once more the binomial Example 
1.9.1. If to is any integer less than n and T\ = ^2™ =1 Xi, T 2 = Y17=m+ 
then (Ti,T 2 ) constitutes a sufficient statistic, since the conditional distribution 
of AT,..., X n given Tj = ti, T 2 = f 2 is independent of p. For the same reason, the 
full sample (AT,..., X n ) itself is also a sufficient statistic. However, T = TTf-i A'; 
provides a more thorough reduction than either of these and than various others 
that can be constructed. A sufficient statistic T is said to be minimal sufficient if 
the data cannot be reduced beyond T without losing sufficiency. For the binomial 
example in particular, yT(T-i Xi can be shown to be minimal (Problem 1.17). This 
illustrates the fact that in specific examples the sufficient statistic determined by 
inspection through the factorization criterion usually turns out to be minimal. 
Explicit procedures for constructing minimal sufficient statistics are discussed in 
Section 1.5 of TPE2. 


1.10 Problems 

Section 1.2 

Problem 1.1 The following distributions arise on the basis of assumptions 
similar to those leading to (1.1)—(1.3). 
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(i) Independent trials with constant probability p of success are carried out until 
a preassigned number m of successes has been obtained. If the number of trials 
required is A + m, then X has the negative binomial distribution Nb(p,m ): 

P { X = x}=(™ + *~ i y m (l- P r, x = 0,1,2.... 

(ii) In a sequence of random events, the number of events occurring in any time 
interval of length r has the Poisson distribution P(Xr), and the numbers of events 
in nonoverlapping time intervals are independent. Then the “waiting time” T, 
which elapses from the starting point, say t = 0 , until the first event occurs, has 
the exponential probability density 

p(t) = Xe ~ Xr , t > 0. 

Let Ti, i > 2, be the time elapsing from the occurrence of the (i — l)st event 
to that of the ith event. Then it is also true, although more difficult to prove, 
that Ti,T 2 ,... are identically and independently distributed. A proof is given, 
for example, in Karlin and Taylor (1975). 

(iii) A point X is selected “at random” in the interval (a, 6 ), that is, the proba¬ 
bility of X falling in any subinterval of (a, 6 ) depends only on the length of the 
subinterval, not on its position. Then X has the uniform distribution U(a, b) with 
probability density 

p(x) = 1/(6 — a), a < x < b. 


Section 1.5 

Problem 1.2 Unbiasedness in point estimation. Suppose that 7 is a continuous 
real-valued function defined over U which is not constant in any open subset of 
S2, and that the expectation h(9) = Eg5(X) is a continuous function of 9 for 
every estimate 5(A') of 7 (9). Then (1.11) is a necessary and sufficient condition 
for 5(A) to be unbiased when the loss function is the square of the error. 
[Unbiasedness implies that 7 2 (9') — 7 2 (9) > 2h(8)[y(9') — 7 (0)] for all 9, 9'. If 9 is 
neither a relative minimum nor maximum of 7 , it follows that there exist points 
9' arbitrarily close to 9 both such that 7 (9) + 7 (9') > and < 2h(9), and hence 
that 7 (9) = h(9). That this equality also holds for an extremum of 7 follows by 
continuity, since 7 is not constant in any open set.] 

Problem 1.3 Median unbiasedness. 

(i) A real number m is a median for the random variable Y if P{Y > m} > 
P{Y < m} > ^ ■ Then all real < 21,02 such that m < ai < «2 or m > ai > a 2 
satisfy E\Y — ai| < E\Y — a 2 |. 

(ii) For any estimate 5(A) of 7 (9), let m~(9) and m + (8) denote the infimum 
and supremum of the medians of 5(A), and suppose that they are continuous 
functions of 9. Let 7 (9) be continuous and not constant in any open subset of 
Q. Then the estimate 5(A') of 7 (9) is unbiased with respect to the loss function 
L(9, d) = | 7 ( 0 ) — d\ if and only if 7 (9) is a median of 5(A) for each 8. An estimate 
with this property is said to be median-unbiased. 
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Problem 1.4 Nonexistence of unbiased procedures. Let Xi ,..., X„ be indepen¬ 
dently distributed with density (l/a)/((a; — £)/a), and let 9 = (£,a). Then 
no estimator of f, exists which is unbiased with respect to the loss function 
(d — (f) k /a k . Note. For more general results concerning the nonexistence of 
unbiased procedures see Rojo (1983). 

Problem 1.5 Let C be any class of procedures that is closed under the transfor¬ 
mations of a group G in the sense that 6 £ C implies g*8g~ 1 £ C for all g £ G. If 
there exists a unique procedure <5o that uniformly minimizes the risk within the 
class C, then <5o is invariant . 7 If So is unique only up to sets of measure zero, then 
it is almost invariant, that is, for each g it satisfies the equation 8{gx) = g*5(x) 
except on a set N g of measure 0. 

Problem 1.6 Relation of unbiasedness and invariance. 

(i) If So is the unique (up to sets of measure 0) unbiased procedure with uniformly 
minimum risk, it is almost invariant. 

(ii) If G is transitive and G * commutative, and if among all invariant (almost 
invariant) procedures there exists a procedure do with uniformly minimum risk, 
then it is unbiased. 

(iii) That conclusion (ii) need not hold without the assumptions concerning G* 
and G is shown by the problem of estimating the mean £ of a normal distribution 
1V(£, a 2 ) with loss function (£ — d) 2 /a 2 . This remains invariant under the groups 
G i : gx = x + b, —oo < b < oo and Gi : gx = ax + b, 0 < a < oo, —oo < b < oo. 
The best invariant estimate relative to both groups is -Y, but there does not exist 
an estimate which is unbiased with respect to the given loss function. 

[(i): This follows from the preceding problem and the fact that when S is unbiased 
so is g*8g~ 1 . 

(ii): It is the defining property of transitivity that given 9 , 9' there exists g such 
that 9' — g9. Hence for any 9,9' 

E g L(9',5 0 {X)) = E e L(g9,5 0 (X)) = E 0 L(9,g*- 1 S o (X)). 

Since G* is commutative, g*~ 1 8o is invariant, so that 

R(9,g*- 1 S 0 ) > R(9,S 0 ) = E e L(9,5 0 {X)).} 


Section 1.6 

Problem 1.7 Unbiasedness in interval estimation. Confidence intervals / = 
(L, L) are unbiased for estimating 9 with loss function L(9 , 1 ) = {9—L) 2 + (L — 9) 2 
provided E[^(L + L)] = 9 for all 9, that is, provided the midpoint of I is an 
unbiased estimate of 9 in the sense of (1.11). 

Problem 1.8 Structure of Bayes solutions. 

(i) Let 0 be an unobservable random quantity with probability density p{9), and 
let the probability density of X be pe{x) when 0 = 9. Then S is a Bayes solution 


'Here and in Problems 1.6, 1.7, 1.11, 1.15, and 1.16 the term “invariant” is used in 
the general sense (1.8) of “invariant or equivalent.” 
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of a given decision problem if for each x the decision 5{x) is chosen so as to 
minimize / L(6,5(x))n(9 \ x) dd , where n(9 \ x) = p(9)pg(x)/ f p{9')pe> (x) d8' is 
the conditional (a posteriori) probability density of 0 given x. 

(i) Let the problem be a two-decision problem with the losses as given in Example 
1.5.5. Then the Bayes solution consists in choosing decision do if 

aP{Q £ u>i \ x} < bP{Q £ u>o\ x} 

and decision d\ if the reverse inequality holds. The choice of decision is immaterial 
in case of equality. 

(iii) In the case of point estimation of a real-valued function g(9 ) with loss function 
L(8,d) = ( g(9 ) — d) 2 , the Bayes solution becomes S(x) = E[g( 0) | x]. When 
instead the loss function is L(8,d) = | g(9) — d\, the Bayes estimate 8{x) is any 
median of the conditional distribution of p(0) given x. 

[(i): The Bayes risk r(p , 5) can be written as f[f L(9 , 8(x))tt(8 | x) d8] x p(x) dx, 
where p(x) = f p(9')p g i(x) dd'. 

(ii) : The conditional expectation f L(8, do)n(9 \ x) dd reduces to aP{Q £ u>i \ x }, 
and similarly for di.] 

Problem 1.9 (i) As an example in which randomization reduces the maximum 
risk, suppose that a coin is known to be either standard (HT) or to have heads on 
both sides (HH). The nature of the coin is to be decided on the basis of a single 
toss, the loss being 1 for an incorrect decision and 0 for a correct one. Let the 
decision be HT when T is observed, whereas in the contrary case the decision is 
made at random, with probability p for HT and 1 — p for HH. Then the maximum 
risk is minimized for p = |. 

(ii) A genetic setting in which such a problem might arise is that of a couple, of 
which the husband is either dominant homozygous (AA) or heterozygous (Aa) 
with respect to a certain characteristic, and the wife is homozygous recessive (aa). 
Their child is heterozygous, and it is of importance to determine to which genetic 
type the husband belongs. However, in such cases an a priori probability is usually 
available for the two possibilities. One is then dealing with a Bayes problem, and 
randomization is no longer required. In fact, if the a priori probability is p that 
the husband is dominant, then the Bayes procedure classifies him as such if p > | 
and takes the contrary decision if p < |. 

Problem 1.10 Unbiasedness and minimax. Let S2 = flo U Hi where 

are mutually exclusive, and consider a two-decision problem with loss function 

L(9, di) = a-i for 9 £ flj(j ^ i) and L(8, di) = 0 for 9 £ fh(i = 0,1). 

(i) Any minimax procedure is unbiased, (ii) The converse of (i) holds provided 
Pg(A ) is a continuous function of 9 for all A, and if the sets flo and fli have at 
least one common boundary point. 

[(i): The condition of unbiasedness in this case is equivalent to sup7?5(0) < 
aoai/(oo + ai). That this is satisfied by any minimax procedure is seen by com¬ 
parison with the procedure <5(a:) = do or = di with probabilities ai/(ao + ai) and 
«o/(fflo + ai) respectively. 

(ii) : If do, is a common boundary point, continuity of the risk function implies 
that any unbiased procedure satisfies Rs(9o) = aoai/(ao + ai) and hence sup 
Rs(9o) = aoai/(ao + ai).] 
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Problem 1.11 Invariance and minimax. Let a problem remain invariant rel¬ 
ative to the groups G, G, and G* over the spaces X , fi, and U respectively. 
Then a randomized procedure Y x is defined to be invariant if for all x and g the 
conditional distribution of Y x given x is the same as that of g*~ 1 Y gx . 

(i) Consider a decision procedure which remains invariant under a finite group 
G = {< 71 ,..., <;jv}. If a minimax procedure exists, then there exists one that 
is invariant, (ii) This conclusion does not necessarily hold for infinite groups, 
as is shown by the following example. Let the parameter space O consist of 
all elements 9 of the free group with two generators, that is, the totality of 
formal products 7n ... 7T n (n = 0,1, 2,...) where each ^ r, is one of the elements 
a, a - 1 , 6 , 6 _1 and in which all products aa -1 , a~ 1 a, fob 1 , and b~ 1 b have been 
canceled. The empty product (n = 0) is denoted by e. The sample point X is 
obtained by multiplying 6 on the right by one of the four elements a, a -1 , b, 6 _1 
with probability \ each, and canceling if necessary, that is, if the random factor 
equals 7 T” 1 . The problem of estimating 6 with L(9 , d) equal to 0 if d = 9 and equal 
to 1 otherwise remains invariant under multiplication of A', 9 , and d on the left 
by an arbitrary sequence 7 r_ m ... 7 r_ 27 r_i(m = 0,1,...). The invariant procedure 
that minimizes the maximum risk has risk function R(9,5) = |. However, there 
exists a noninvariant procedure with maximum risk 

[(i): If Y x is a (possibly randomized) minimax procedure, an invariant minimax 
procedure Y x is defined by P(Y X = d) = i PXai* = 9 id)/N. 

(ii) : The better procedure consists in estimating 9 to be 7 ri ... 7 ri,_i when 7 ri ... nk 
is observed (k > 1), and estimating 9 to be a, a - 1 , 6 , b with probability | each in 
case the identity is observed. The estimate will be correct unless the last element 
of A' was canceled, and hence will be correct with probability > f.] 


Section 1.7 

Problem 1.12 (i) Let A' have probability density pg(x) with 9 one of the values 
9i,... ,9 n , and consider the problem of determining the correct value of 9, so 
that the choice lies between the n decisions d\ = 9i ,..., d n = 9„ with gain 
a(9i) if di = 9i and 0 otherwise. Then the Bayes solution (which maximizes the 
average gain) when 9 is a random variable taking on each of the n values with 
probability 1/n coincides with the maximum-likelihood procedure, (ii) Let X 
have probability density pe(x) with 0 < 9 < 1. Then the maximum-likelihood 
estimate is the mode (maximum value) of the a posteriori density of 0 given x 
when 0 is uniformly distributed over ( 0 , 1 ). 


Problem 1.13 (i) Let Ai,... , X n be a sample from N(£,cr 2 ), and consider the 
problem of deciding between ojo : £ < 0 and uq : £ > 0. If x = ^2 Xi/n and 
C = (ai/ao) 2/,T \ the likelihood-ratio procedure takes decision do or d, as 


\fnx 

vzx-xr 


< k 


or 


> k, 


where k = VC^l if G > 1 and k = y/(l - C)/C if G < 1. 
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(ii) For the problem of deciding between u>o : cr < ao and u>i : a > ao the 
likelihood ratio procedure takes decision do or d , as 


E (Xi-x ) 2 
o 


< 


or 


> k, 


where k is the smaller root of the equation Cx = e x 1 if C > 1, and the larger 
root of x — Ce x ~ 1 if C < 1, where C is defined as in (i). 


Section 1.8 

Problem 1.14 Admissibility of unbiased procedures. 

(i) Under the assumptions of Problem 1.10, if among the unbiased procedures 
there exists one with uniformly minimum risk, it is admissible, (ii) That in general 
an unbiased procedure with uniformly minimum risk need not be admissible is 
seen by the following example. Let X have a Poisson distribution truncated at 
0, so that Pe{X = x} = 9 x e~ e /[x\( 1 — e - ®)] for x = 1,2,.... For estimating 
j(9) = e“® with loss function L(9, d) = (d — e - ®) 2 , there exists a unique unbiased 
estimate, and it is not admissible. 

[(ii): The unique unbiased estimate So(x) = (—l) 31 " 1-1 is dominated by Si(x) = 0 
or 1 as i is even or odd.] 

Problem 1.15 Admissibility of invariant procedures. If a decision problem 
remains invariant under a finite group, and if there exists a procedure <5o 
that uniformly minimizes the risk among all invariant procedures, then <5o is 
admissible. 

[This follows from the identity R(9,S) = R{g9 , g*Sg -1 ) and the hint given in 
Problem l.ll(i).] 

Problem 1.16 (i) Let .Y take on the values 9—1 and 9+1 with probability 
| each. The problem of estimating 9 with loss function L(9,d) = min(|# — d\, 1) 
remains invariant under the transformation gX — X' + c, g9 = 9 + c, g* d = d +c. 
Among invariant estimates, those taking on the values X — 1 and X + 1 with 
probabilities p and q (independent of .Y) uniformly minimize the risk, (ii) That the 
conclusion of Problem 1.15 need not hold when G is infinite follows by comparing 
the best invariant estimates of (i) with the estimate <5i(a:) which is A' + 1 when 
X < 0 and X — 1 when X > 0. 


Section 1.9 

Problem 1.17 In n independent trials with constant probability p of success, 
let Xi = 1 or 0 as the ith trial is a success or not. Then E"=i i s minimal 
sufficient. 

[Let T = EE an d suppose that U = f(T) is sufficient and that f{k\ ) ==.•*•• = 
f{k r ) = u. Then P{T = t \ U = u} depends on p.\ 

Problem 1.18 (i) Let Xi,..., X n be a sample from the uniform distribution 
U(0,9), 0 < 9 < oo, and let T = max(AT,..., A' n ). Show that T is sufficient, 
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once by using the definition of sufficiency and once by using the factorization 
criterion and assuming the existence of statistics Y) satisfying (1.17)-(1.19). 

(ii) Let Xi,... ,X n be a sample from the exponential distribution E(a,b) with 
density (1 /b)e~ ( ' x ~ a ^ b when x> a (—oo < a < oo, 0 < b). Use the factorization 
criterion to prove that (min(A'i,..., X n ), X]"=i -X)) is sufficient for a, b, assuming 
the existence of statistics Y satisfying (1.17)-(1.19). 

Problem 1.19 A statistic T satisfying (1.17)-(1.19) is sufficient if and only if it 
satisfies (1.20). 


1.11 Notes 

Some of the basic concepts of statistical theory were initiated during the first 
quarter of the 19th century by Laplace in his fundamental Theorie Analytique 
des Probabilites (1812), and by Gauss in his papers on the method of least squares. 
Loss and risk functions are mentioned in their discussions of the problem of point 
estimation, for which Gauss also introduced the condition of unbiasedness. 

A period of intensive development of statistical methods began toward the end 
of the century with the work of Karl Pearson. In particular, two areas were ex¬ 
plored in the researches of R. A. Fisher, J. Neyman, and many others: estimation 
and the testing of hypotheses. The work of Fisher can be found in his books 
(1925, 1935, 1956) and in the five volumes of his collected papers (1971-1973). 
An interesting review of Fisher’s contributions is provided by Savage (1976), and 
his life and work are recounted in the biography by his daughter Joan Fisher 
Box (1978). Many of Neyman’s principal ideas are summarized in his Lectures 
and Conferences (1938b). Collections of his early papers and of his joint papers 
with E. S. Pearson have been published [Neyman (1967) and Neyman and Pear¬ 
son (1967)], and Constance Reid (1982) has written his biography. An influential 
synthesis of the work of this period by Cramer appeared in 1946. Further concepts 
were introduced in Lehmann (1950, 1951ab). More recent surveys of the modern 
theories of estimation and testing are contained, for example, in the books by 
Strasser (1985), Stuart and Ord (1991, 1999), Schervish (1995), Shao (1999) and 
Bickel and Doksum (2001). 

A formal unification of the theories of estimation and hypothesis testing, which 
also contains the possibility of many other specializations, was achieved by Wald 
in his general theory of decision procedures. An account of this theory, which 
is closely related to von Neumann’s theory of games, is found in Wald’s book 
(1950) and in those of Blackwell and Girshick (1954), Ferguson (1967), and Berger 
(1985b). 
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2.1 Probability and Measure 

The mathematical framework for statistical decision theory is provided by the 
theory of probability, which in turn has its foundations in the theory of measure 
and integration. The present chapter serves to define some of the basic concepts of 
these theories, to establish some notation, and to state without proof some of the 
principal results which will be used throughout Chapters 3-9. In the remainder 
of this chapter, certain special topics are treated in more detail. Basic notions of 
convergence in probability theory which will be needed for large sample statistical 
theory are deferred to Section 11.2. 

Probability theory is concerned with situations which may result in different 
outcomes. The totality of these possible outcomes is represented abstractly by 
the totality of points in a space Z. Since the events to be studied are aggregates 
of such outcomes, they are represented by subsets of Z. The union of two sets 
Ci, C 2 will be denoted by Ci U C 2 , their intersection by Cl fl C 2 , the complement 
of C by C c = Z — C, and the empty set by 0. The probability P(C) of an event 
C is a real number between 0 and 1; in particular 

P( 0) = 0 and P{Z) = 1 (2.1) 

Probabilities have the property of countable additivity, 

P (y Cl) = P(Ci) if Ci C Cj = 0 for all i ± j. (2.2) 

Unfortunately it turns out that the set functions with which we shall be con¬ 
cerned usually cannot be defined in a reasonable manner for all subsets of Z 
if they are to satisfy (2.2). It is, for example, not possible to give a reasonable 
definition of “area” for all subsets of a unit square in the plane. 
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The sets for which the probability function P will be defined are said to be 
“measurable.” The domain of definition of P should include with any set C its 
complement C c , and with any countable number of events their union. By (2.1), 
it should also include Z. A class of sets that contains Z and is closed under 
complementation and countable unions is a cr-field. Such a class is automatically 
also closed under countable intersections. 

The starting point of any probabilistic considerations is therefore a space Z , 
representing the possible outcomes, and a cr-field C of subsets of Z , representing 
the events whose probability is to be defined. Such a couple ( Z,C ) is called 
a measurable space , and the elements of C constitute the measurable sets. A 
countably additive nonnegative (not necessarily finite) set function p defined 
over C and such that p(0) = 0 is called a measure. If it assigns the value 1 to Z, 
it is a probability measure. More generally, p is finite if p(Z) < oo and cr-finite if 
there exist Ci, C 2 ,... in C (which may always be taken to be mutually exclusive) 
such that UCi = Z and p{Ci) < 00 for i = 1,2,.... Important special cases are 
provided by the following examples. 

Example 2.1.1 (Lebesgue measure) Let Z be the n-dimensional Euclidean 
space E n , and C the smallest cr-field containing all rectangles 1 

R = {(zi,... ,Zn) ■■ tti < Zi < bi, i = 1,... ,n}. 

The elements of C are called the Borel sets of E n . Over C a unique measure p 
can be defined, which to any rectangle R assigns as its measure the volume of R, 

n 

- en). 

i=l 

The measure p can be completed by adjoining to C all subsets of sets of measure 
zero. The domain of p is thereby enlarged to a cr-field C' , the class of Lebesgue- 
measurable sets. The term Lebesgue-measure is used for p both when it is defined 
over the Borel sets and when it is defined over the Lebesgue-measurable sets. ■ 

This example can be generalized to any nonnegative set function v, which is 
defined and countably additive over the class of rectangles R. There exists then, 
as before, a unique measure p over ( Z,C ) that agrees with v for all R. This 
measure can again be completed; however, the resulting cr-field depends on p and 
need not agree with the cr-field C' obtained above. 

Example 2.1.2 (Counting measure) Suppose the Z is countable, and let C 
be the class of all subsets of Z. For any set C, define p{C) as the number of 
elements of C if that number is finite, and otherwise as + 00 . This measure is 
sometimes called counting measure. ■ 

In applications, the probabilities over (Z,C) refer to random experiments or 
observations, the possible outcomes of which are the points z £ Z. When record¬ 
ing the results of an experiment, one is usually interested only in certain of its 


1 If r(z) is a statement concerning certain objects z, then : n(z )} denotes the set 
of all those z for which rr (z) is true. 
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aspects, typically some counts or measurements. These may be represented by a 
function T taking values in some space T. 

Such a function generates in T the cr-field B' of sets B whose inverse image 

C = T~ 1 (B) = {z:z€ Z,T{z ) £ B} 

is in C, and for any given probability measure P over ( Z,C ) a probability measure 
Q over (T, B') defined by 

Q(B) = P(T~ 1 (B)). (2.3) 

Frequently, there is given a cr-field B of sets in T such that the probability 
of B should be defined if and only if B £ B. This requires that T~ 1 (B) £ C 
for all B £ B, and the function (or transformation) T from ( Z,C ) into 2 (T, B) is 
then said to be C-measurable. Another implication is the sometimes convenient 
restriction of probability statements to the sets B £ B even though there may 
exist sets B (j B for which T~ 1 (B) £ C and whose probability therefore could be 
defined. 

Of particular interest is the case of a single measurement in which the function 
of T is real-valued. Let us denote it by X, and let A be the class of Borel sets 
on the real line X. Such a measurable real-valued X is called a random variable, 
and the probability measure it generates over (X. A) will be denoted by P x and 
called the probability distribution of X. The value this measure assigns to a set 
A £ A will be denoted interchangeably by P x (A) and P(X £ A). Since the 
intervals {x : x < a} are in A, the probabilities F(a) = P(X < a) are defined for 
all a. The function F, the cumulative distribution function (cdf) of X, is nonde¬ 
creasing and continuous on the right, and F(— oo) = 0, F’(-l-oo) = 1. Conversely, 
if F is any function with these properties, a measure can be defined over the 
intervals by P{a < X < b} = F(b ) — F(a). It follows from Example 2.1.1 that 
this measure uniquely determines a probability distribution over the Borel sets. 
Thus the probability distribution P x and the cumulative distribution function F 
uniquely determine each other. These remarks extend to probability distributions 
over n-dimensional Euclidean space, where the cumulative distribution function 
is defined by 


F(ai ,..., On) — P { X± £ tti,..., X n £ fl«}. 

In concrete problems, the space ( Z,C ), corresponding to the totality of possi¬ 
ble outcomes, is usually not specified and remains in the background. The real 
starting point is the set A' of observations (typically vector-valued) that are be¬ 
ing recorded and which constitute the data, and the associated measurable space 
(X,A), the sample space. Random variables or vectors that are measurable trans¬ 
formations T from (X,A) into some (T,B) are called statistics. The distribution 
of T is then given by (2.3) applied to all B £ B. With this definition, a statistic 
is specified by the function T and the cr-field B. We shall, however, adopt the 
convention that when a function T takes on its values in a Euclidean space, unless 
otherwise stated the cr-field B of measurable sets will be taken to be the class of 


2 The term into indicates that the range of T is in T; if T(Z) = T, the transformation 
is said to be from Z onto T. 
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Borel sets. It then becomes unnecessary to mention it explicitly or to indicate it 
in the notation. 

The distinction between statistics and random variables as defined here is 
slight. The term statistic is used to indicate that the quantity is a function of 
more basic observations; all statistics in a given problem are functions defined 
over the same sample space (X,A). On the other hand, any real-valued statistic 
T is a random variable, since it has a distribution over (T, B), and it will be 
referred to as a random variable when its origin is irrelevant. Which term is used 
therefore depends on the point of view and to some extent is arbitrary. 


2.2 Integration 


According to the convention of the preceding section, a real-valued function / 
defined over (X,A) is measurable if f~ 1 (B) £ A for every Borel set B on the 
real line. Such a function / is said to be simple if it takes on only a finite number 
of values. Let p be a measure defined over (X , A), and let / be a simple function 
taking on the distinct values ai,..., a m on the sets Ai,..., A m , which are in A, 
since / is measurable. If p(Ai) < oo when cn ^ 0, the integral of / with respect 
to p is defined by 

J f dp = s ^a i p(A i ). (2.4) 

Given any nonnegative measurable function /, there exists a nondecreasing 
sequence of simple functions f n converging to /. Then the integral of / is defined 
as 

[ f dp = lim [ f n dp, (2.5) 

J n-J-oo J 

which can be shown to be independent of the particular sequence of /„’s chosen. 
For any measurable function / its positive and negative parts 

f + (x) = max[/(a:),0] and f~(x) = max[-/(i), 0] (2.6) 


are also measurable, and 


fix) = f + (x)~ f (*). 


If the integrals of f + and / are both finite, then / is said to be integrable, and 
its integral is defined as 



/ dp. 


If of the two integrals one is finite and one infinite, then the integral of / is 
defined to be the appropriate infinite value; if both are infinite, the integral is 
not defined. 


Example 2.2.1 Let X be the closed interval [a, 6], A be the class of Borel sets or 
of Lebesgue measurable sets in X, and p be Lebesgue measure. Then the integral 
of / with respect to p is written as f b f(x) dx, and is called the Lebesgue integral 
of /. This integral generalizes the Riemann integral in that it exists and agrees 
with the Riemann integral of / whenever the latter exists. ■ 
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Example 2.2.2 Let X be countable and consist of the points xi,X 2 , ...; let ^4. 
be the class of all subsets of X , and let p assign measure bi to the point Xi. Then 
/ is integrable provided f(xi)bi converges absolutely, and f f dp is given by 
this sum. ■ 

Let P x be the probability distribution of a random variable X, and let T be a 
real-valued statistic. If the function T(x) is integrable, its expectation is defined 
by 

E(T) = J T(x)dP x (x). (2.7) 

It will be seen from Lemma 2.3.2 in Section 2.3 below that the integration can be 
carried out alternatively in f-space with respect to the distribution of T defined 
by (2.3), so that also 

E(T) = J tdP T {t). (2.8) 

The definition (2.5) of the integral permits the basic convergence theorems. 


Theorem 2.2.1 Fatou’s Lemma Let f n be a sequence of measurable functions 
such that fn(x) > 0 and fn(x) —> f{x), except possibly on a set of x values having 
p measure 0. Then, 


fdp < liminf 


fndp . 


Theorem 2.2.2 Let f n be a sequence of measurable functions, and let f n (x) —» 
f(x), except possibly on a set of x values having p measure 0. Then 


f n dp 



if any one of the following conditions holds: 


(i) Lebesgue Monotone Convergence Theorem: the /„’s are nonneg¬ 
ative and the sequence is nondecreasing; 


or 

(ii) Lebesgue Dominated Convergence Theorem: there exists an 
integrable function g such that \fn(x)\ < g(x) for n and x. 

or 


(iii) General Form: there exist g n and g with |/„| < g n , g n {x) —> g(x) 
except possibly on a p null set, and f g„dp —> f gdp. 


Corollary 2.2.1 Vitali’s Theorem Suppose f n and f are real-valued measur¬ 
able functions with f n (x) —> f(x), except possibly on a set having p measure 0. 
Assume 


lim sup 

n 


J fn( x )dp{x) < J 


f 2 (x)dp{x) < oo . 
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Then, 

j I fu(x) - f (x)\ 2 dp(x) -t 0 


For a proof of this result, see Theorem 6.1.3 of Hajek, Sidak, and Sen (1999). 
For any set A £ A, let I a be its indicator function defined by 

Ia{x) = 1 or 0 as x £ A or x £ A c , (2.9) 


and let 



//a dp. 


( 2 . 10 ) 


If p is a measure and / a nonnegative measurable function over (X,A), then 


H A )= [ f d T ( 2 -ll) 

J A 

defines a new measure over (X,A). The fact that (2.11) holds for all A £ A is 
expressed by writing 

dv = f dp or f=^~. (2.12) 

Let p and v be two given a-Hnite measures over (X , A). If there exists a function 
/ satisfying (2.12), it is determined through this relation up to sets of measure 
zero, since 


/ f dp = / g dp for all A £ A 
J a J A 

implies that f = g a.e. p. 3 Such an / is called the Radon-Nikodym derivative of 
v with respect to \j, and in the particular case that v is a probability measure, 
the probability density of v with respect to p. 

The question of existence of a function / satisfying (2.12) for given measures p 
and v is answered in terms of the following definition. A measure v is absolutely 
continuous with respect to p if 


p(A) = 0 implies v(A) = 0. 


Theorem 2.2.3 (Radon-Nikodym) If p and v are a-finite measures over 
{X,A), then there exists a measurable function f satisfying (2.12) if and only 
if v is absolutely continuous with respect to p. 

The direct (or Cartesian) product A x B of two sets A and B is the set of all 
pairs (x, y) with x € A, y £ B. Let (X,A) and ( y,B) be two measurable spaces, 
and let A x B be the smallest cr-held containing all sets A x B with A £ A and 
B £ B. If p and v are two cr-finite measures over (X,A) and ( y,B ) respectively, 


3 A statement that holds for all points x except possibly on a set of //-measure zero 
is said to hold almost everywhere p, abbreviated a.e. // : or to hold a.e. ( A , p) if it is 
desirable to indicate the cr-field over which p is defined. 
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then there exists a unique measure A = p x v over (X x y, Ax 23), the product 
of p and v, such that for any A £ A, B £ B, 

A(Ax B) = p{A)v(B). (2.13) 

Example 2.2.3 Let X,y be Euclidean spaces of m and n dimensions, and let 
A, B be the tr-fields of Borel sets in these spaces. Then X x y is an (m + n )- 
dimensional Euclidean space, and Ax B the class of its Borel sets. ■ 


Example 2.2.4 Let Z = (A', Y) be a random variable defined over (X x y, A x 
B), and suppose that the random variables X and Y have distributions P x , P 5 
over (X,A) and ( y,B ). Then X and Y are said to be independent if the 
probability distribution P z of Z is the product P x x P Y . ■ 

In terms of these concepts the reduction of a double integral to a repeated one 
is given by the following theorem. 


Theorem 2.2.4 (Fubini) Let p and v be a-finite measures over (X,A) and 
(T, B) respectively, and let A = p x v. If f(x, y) is integrable with respect to A, 
then 


(i) for almost all (v) fixed y, the function f{x , y) is integrable with respect to p, 

(ii) the function f f(x, y ) dp(x) is integrable with respect to v, and 


J f(x,y)dX(x,y) = l[l f(x, y) dp(x) 


dv(y). 


(2.14) 


2.3 Statistics and Subfields 

According to the definition of Section 2.1, a statistic is a measurable transfor¬ 
mation T from the sample space {X,A) into a measurable space ( T,B ). Such a 
transformation induces in the original sample space the subfield 4 

Ao=T~ 1 (B) = {T-\B) :BgB}. (2.15) 

Since the set T -1 [T(A)] contains A but is not necessarily equal to A , the cr-field 
Ao need not coincide with A and hence can be a proper subfield of A. On the other 
hand, suppose for a moment that T = T(X), that is, that the transformation T 
is onto rather than into T. Then 

T[T~ 1 (B)\=B for all B € B, (2.16) 

so that the relationship Ao = T~ 1 (B) establishes a 1:1 correspondence between 
the sets of Ao and B, which is an isomorphism—that is, which preserves the set 
operations of intersection, union, and complementation. For most purposes it is 
therefore immaterial whether one works in the space (X, Ao) or in ( T,B ). These 
generate two equivalent classes of events, and therefore of measurable functions, 
possible decision procedures, etc. If the transformation T is only into T, the above 


4 We shall use this term in place of the more cumbersome “sub-cr-field.” 
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1:1 correspondence applies to the class B' of subsets of T' = T(X) which belong 
to B, rather than to B itself. However, any set B £ B is equivalent to B' = BnT' 
in the sense that any measure over ( X,A) assigns the same measure to B' as to 
B. Considered as classes of events, Ao and B therefore continue to be equivalent, 
with the only difference that B contains several (equivalent) representations of 
the same event. 

As an example, let X be the real line and A the class of Borel sets, and let 
T(x ) = x 2 . Let T be either the positive real axis or the whole real axis, and let 
B be the class of Borel subsets of T. Then Ao is the class of Borel sets that are 
symmetric with respect to the origin. When considering, for example, real-valued 
measurable functions, one would, when working in T-space, restrict attention 
to measurable function of x 2 . Instead, one could remain in the original space, 
where the restriction would be to the class of even measurable functions of x. 
The equivalence is clear. Which representation is more convenient depends on 
the situation. 

That the correspondence between the sets Ao = T~ 1 (B) £ Ao and B £ B 
establishes an analogous correspondence between measurable functions defined 
over (A, Ao) and ( T,B ) is shown by the following lemma. 


Lemma 2.3.1 Let the statistic T from (A, A) into ( T,B ) induce the subfield Ao- 
Then a real-valued A-measurable function f is Ao-measurable if and only if there 
exists a B-measurable function g such that 

f{x) = g{T(x)\ 


for all x. 

Proof. Suppose first that such a function g exists. Then the set 
{x : f(x) < r} = T _1 ({t : g{t) < r}) 

is in Ao, and / is Ao-measurable. Conversely, if / is Ao-measurable, then the sets 

A in = j x: /(*) - ^T 1 } ’ * = 0,±1,±2,..., 

are (for fixed n ) disjoint sets in Ao whose union is X, and there exist Bi n £ B 
such that Ai n = T~ 1 (B 

in ). Let 

B*n = Bin n {(J BjnY ■ 
j^i 

Since A; n and Aj n are mutually exclusive for i ^ j, the set T~ 1 (Bi n D Bj n ) is 
empty and so is the set T _1 (.Bj n fl {B* n } c ). Hence, for fixed n, the sets Bf n are 
disjoint, and still satisfy Ai n = T 1 (B*n)- Defining 

fn(x) = ^ if X £ Ain, i = 0±1,±2,..., 


one can write 


fn(x) = g n [T(x)], 
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where 

f ^ for t€B* n , i = Oil, ±2,..., 

9n(t) = < 

I 0 otherwise. 

Since the functions g n are Z3-measurable, the set B on which g n (t ) converges to 
a finite limit is in B. Let R = T(X) be the range of T. Then for t £ R, 

lim <7 „[T(x)] = lim f n (x) = f(x) 

for all x £ X so that R is contained in B. Therefore, the function g defined 
by g{t) = lim g n (t) for t £ B and g(t) = 0 otherwise possesses the required 
properties. ■ 

The relationship between integrals of the functions / and g above is given by 
the following lemma. 

Lemma 2.3.2 LetT be a measurable transformation from ( X,A ) into (T,B), g 
a o-finite measure over (X,A), and g a real-valued measurable function oft. If 
g* is the measure defined over (T, B) by 

g*{B) = g [T _1 (B)] for all B £ B, (2.17) 

then for any B £ B, 

[ g[T{x)]dg(x)= ( g(t) dg*(t) (2.18) 

in the sense that if either integral exists, so does the other and the two are equal. 

Proof. Without loss of generality let B be the whole space T. If g is the indicator 
of a set Bo £ B, the lemma holds, since the left- and right-hand sides of (2.18) 
reduce respectively to /r[T -1 (Bo)] and g*(Bo), which are equal by the definition 
of g*. If follows that (2.18) holds successively for all simple functions, for all 
nonnegative measurable functions, and hence finally for all integrable functions. 


2.4 Conditional Expectation and Probability 

If two statistics induce the same subfield Ao, they are equivalent in the sense of 
leading to equivalent classes of measurable events. This equivalence is particu¬ 
larly relevant to considerations of conditional probability. Thus if X is normally 
distributed with zero mean, the information carried by the statistics |A'|, X 2 , 
e~ x , and so on, is the same. Given that \X\ = t, X 2 = t 2 , e~ x = e _t , it 
follows that X is ±t, and any reasonable definition of conditional probability will 
assign probability | to each of these values. The general definition of conditional 
probability to be given below will in fact involve essentially only Ao and not the 
range space T of T. However, when referred to ,4o alone the concept loses much 
of its intuitive meaning, and the gap between the elementary definition and that 
of the general case becomes unnecessarily wide. For these reasons it is frequently 
more convenient to work with a particular representation of a statistic, involving 
a definite range space (T, B). 
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Let P be a probability measure over (X,A), T a statistic with range space 
(T, B), and Ao the subfield it induces. Consider a nonnegative function / which is 
integrable (.A, P), that is A -measurable and P-integrable. Then J A f dP is defined 
for all A £ A and therefore for all Ao £ Ao . If follows from the Radon-Nikodym 
theorem (Theorem 2.2.3) that there exists a function fo which is integrable 
(Ao, P) and such that 



for all Ao £ Ao, 


(2.19) 


and that fo is unique (Ao, P). By Lemma 2.3.1, fo depends on x only through 
T(x). In the example of a normally distributed variable X with zero mean, and 
T = X 2 , the function fo is determined by (2.19) holding for all sets Ao that are 
symmetric with respect to the origin, so that fo(x) = |[/(x) + f(—x)]. 

The function fo defined through (2.19) is determined by two properties: 


(i) Its average value over any set Ao with respect to P is the same as that of /; 


(ii) It depends on x only through T(x) and hence is constant on the sets D x over 
which T is constant. 


Intuitively, what one attempts to do in order to construct such a function is 
to define fo{x) as the conditional P -average of / over the set D x . One would 
thereby replace the single averaging process of integrating / represented by the 
left-hand side with a two-stage averaging process such as an iterated integral. 
Such a construction can actually be carried out when A' is a discrete variable 
and in the regular case considered in Section 1.9; fo(x) is then just the condi¬ 
tional expectation of f{X) given T(x). In general, it is not clear how to define 
this conditional expectation directly. Since it should, however, possess properties 
(i) and (ii), and since these through (2.19) determine fo uniquely (Ao,P), we 
shall take fo(x) of (2.19) as the general definition of the conditional expectation 
E[f(X ) | T(x)]. Equivalently, if fo(x) = g[T(x)\, one can write 


E[f(X)\t] = E[f(X)\Tmt]=g(t), 

so that E[f(X) | t] is a 0-measurable function defined up to equivalence (B, P T ). 
In the relationship of integrals given in Lemma 2.3.2, if p = P x , then p* = P T , 
and it is seen that the function g can be defined directly in terms of / through 


!(S) 




f(x)dP x (x)= / g(t) dP T (t) for all B £ B, 


( 2 . 20 ) 


which is equivalent to (2.19). 

So far, / has been assumed to be nonnegative. In the general case, the 
conditional expectation of / is defined as 


E[f{X) | t] = E[f + (X) | t] - E[f~(X) \ t}. 


Example 2.4.1 (Order statistics) Let Xi,...,A'„ be identically and inde¬ 
pendently distributed random variables with continuous distribution function, 
and let 


T{X !,...,£„) = (*(!),..., *(„)) 
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where X(i) < • • • < X(„) denote the ordered x’s. Without loss of generality one 
can restrict attention to the points with X(i) < ■ ■ ■ < X(„), since the probability 
of two coordinates being equal is 0. Then X is the set of all n-tuples with distinct 
coordinates, T the set of all ordered n-tuples, and A and B are the classes of 
Borel subsets of X and T. Under T -1 the set consisting of the single point a = 
(ai,..., a„) is transformed into the set consisting of the n! points (aq,..., a,i n ) 
that are obtained from a by permuting the coordinates in all possible ways. It 
follows that Ao is the class of all sets that are symmetric in the sense that if Ao 
contains a point x = (xi, ... ,x n ), then it also contains all points (xq,... ,Xj n ). 

For any integrable function /, let 

/o(x) = 

n\ z —' 

where the summation extends over the n\ permutations of (xi, . . . ,x„). Then fo 
is Ao-measurable, since it is symmetric in its n arguments. Also 

f f(xi,...,x n )dP(xi)...dP(x„)= [ f(x il ,...,Xi n )dP(xi)...dP(x n ), 

J Aq J Aq 

so that fo satisfies (2.19). It follows that fo(x) is the conditional expectation of 
f{X) given T(x). 

The conditional expectation of /( X) given the above statistic T(x) can also be 
found without assuming the A”’s to be identically and independently distributed. 
Suppose that X has a density h(x) with respect to a measure /r (such as Lebesgue 
measure), which is symmetric in the variables xi,..., x„ in the sense that for any 
A £ A it assigns to the set {x : (xq,..., x< n ) £ A} the same measure for all 
permutations (*!,...,*«)■ Let 


fo (xi,... ,x„) 


E f( x n > ■ ■ ■ > x in )h(x q, ■ ■ ■, x in ) _ 
E M®*1> • ■ ■ i Xi n ) 


here and in the sums below the summation extends over the n! permutations 
of (xi,... ,x n ). The function fo is symmetric in its n arguments and hence Ao~ 
measurable. For any symmetric set Ao, the integral 


/ fo (xi,.. . ,x„)h(xq,. . .,xj n )d/j,(x i,... ,x„) 

J Aq 

has the same value for each permutation (xq,..., x Jn ) , and therefore 


/ fo (xi,.. .,x„)h(x i, ... ,x„) dfj,(x i, ... ,x„) 

Ja 0 

= [ fo(xi,...,x n )—. Y'h(xq,...,x <n )d / u(xi I ...,x„) 

Ja„ n\ ^ 


= / /(xi, . . . ,x n )h(xi, ... ,x„) dfj,(xi,... ,x„). 

JA 0 


It follows that /o(x) = E[f(X ) | T(x)]. 

Equivalent to the statistic T(x) = (x(i),..., X(„)), the set of order statistics, is 
U (x) = (E liiExn-'iEx”). This is an immediate consequence of the fact, 
to be shown below, that if T(x°) = t° and U(x°) = u°, then 
T- 1 ({t°})=U-'({u°})=S 
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where {£°} and {it 0 } denote the sets consisting of the single point t° and u° re¬ 
spectively, and where S consists of the totality of points x = {x\, ..., x„) obtained 
by permuting the coordinates of x° = (*°,..., a; 0 ) in all possible ways. 

That T -1 ({t 0 }) = S is obvious. To see the corresponding fact for U~ 1 , let 

v(x)= y^xi,y^xiXj, y xixjx k ,...,xix2---x„\ , 

\ i i<j i<j<k J 

so that the components of V(x) are the elementary symmetric functions Vi = 
Xi,..., v n = xi ... x„ of the n arguments xi,... ,x n . Then 

(X — Xl) . . . [X — Xn) = x —VlX +V 2 X -+ (— 1) V n - 

Hence V(x°) = v° — (v 0 ,..., u°) implies that E _1 ({i’ 0 }) = S. That then also 
C/ _1 ({u 0 }) = S follows from the 1:1 correspondence between u and v established 
by the relations (known as Newton’s identities): 5 

Uk — viUk-i + V2Uk-2 -+ (— l) k ~ 1 vk-iui + (—1 ) k kvk = 0 

for 1 < k < n. ■ 

It is easily verified from the above definition that conditional expectation pos¬ 
sesses most of the usual properties of expectation. It follows of course from the 
nonuniqueness of the definition that these properties can hold only ( B,P T ). We 
state this formally in the following lemma. 

Lemma 2.4.1 IfT is a statistic and the functions f, g, ... are integrable (A, P), 
then a.e. (B, P T ) 

(i) E[af(X) + bg(X) \ t ] = aE\f(X) \ t] + bE[g(X) \ t\; 

(ii) E[h(T)f(X) | t] = h(t)E\f(X) | t\; 

(iii) a < f(x) < b (A, P) implies a < E[f(X) \ t] < b; 

(iv) \fn\ < g, fn(x) -)■ f(x) (A,P ) implies E[f„(X) \ t] -s- E[f(X) \ t]. 

A further useful result is obtained by specializing (2.20) to the case that B is 
the whole space T. One then has 

Lemma 2.4.2 If E[\f(X)\] < oo, and if g(t) = E[f(X) \ t], then 

E[f(X)] = E[g(T)] ; (2.21) 

that is, the expectation can be obtained as the expected value of the conditional 
expectation. 

Since P{X £ A} = E[Ia{X)\, where I a denotes the indicator of the set A, it 
is natural to define the conditional probability of A given T = t. by 

P(A | t) = E[I a (X) | t], (2.22) 


'''For a proof of these relations see for example Turnbull (1952), Section 32. 
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In view of (2.20) the defining equation for P(A \ t ) can therefore be written as 


P x (Anr'(B)) 


/ 

J AnT - 1 


dP X {x ) 


AnT- 1 (s) 


P(A | t) dP T (t) 


(2.23) 


for all B € B. 


It is an immediate consequence of Lemma 2.4.1 that subject to the appropriate 
null-set 6 qualifications, P(A \ t) possesses the usual properties of probabilities, 
as summarized in the following lemma. 


Lemma 2.4.3 If T is a statistic with range space (' T,B ), and A, B, Ai, A 2 ,... 
are sets belonging to A, then a.e. ( B,P T ) 

(i) 0 < P(A | t) < 1; 

(ii) if the sets Ai, A?,... are mutually exclusive, 

p([jA\t)=J2P(^\ty } 

(iii) A C B implies P(A \ t) < P(B \ t). 

According to the definition (2.22), the conditional probability P(A \ t) must 
be considered for fixed A as a H-measurable function of t. This is in contrast to 
the elementary definition in which one takes t as fixed and considers P(A \ t) 
for varying A as a set function over A. Lemma 2.4.3 suggests the possibility that 
the interpretation of P(A \ t ) for fixed t as a probability distribution over A 
may be valid also in the general case. However, the equality P(Ai U A 2 | t) = 
P(Ai | t) + P(Az | t), for example, can break down on a null set that may vary 
with Ai and A 2 , and the union of all these null sets need no longer have measure 
zero. 

For an important class of cases, this difficulty can be overcome through the 
nonuniqueness of the functions P(A \ t), which for each fixed A are determined 
only up to sets of measure zero in t. Since all determinations of these functions 
are equivalent, it is enough to find a specific determination for each A so that for 
each fixed t these determinations jointly constitute a probability distribution over 
A. This possibility is illustrated by Example 2.4.1, in which the conditional prob¬ 
ability distribution given T(x) = t can be taken to assign probability 1/n! to each 
of the n\ points satisfying T(x) = t. Sufficient conditions for the existence of such 
conditional distributions will be given in the next section. For counterexamples 
see Blackwell and Dubins (1975). 


This term is used as an alternative to the more cumbersome “set of measure zero. 1 
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2.5 Conditional Probability Distributions 7 

We shall now investigate the existence of conditional probability distributions 
under the assumption, satisfied in most statistical applications, that T is a Borel 
set in a Euclidean space. We shall then say for short that X is Euclidean and 
assume that, unless otherwise stated, A is the class of Borel subsets of X. 

Theorem 2.5.1 If X is Euclidean, there exist determinations of the functions 
P(A | t) such that for each t, P(A \ t) is a probability measure over A. 

Proof. By setting equal to 0 the probability of any Borel set in the complement 
of X, one can extend the given probability measure to the class of all Borel sets 
and can therefore assume without loss of generality that X is the full Euclidean 
space. For simplicity we shall give the proof only in the one-dimensional case. 
For each real x put F(x,t) = P((—oo,*] | t) for some version of this conditional 
probability function, and let n,r2, ■ ■ ■ denote the set of all rational numbers in 
some order. Then n < rj implies that F(n,t) < F(rj,t) for all t except those in a 
null set Nij , and hence that F(x, t) is nondecreasing in x over the rationals for all t 
outside of the null set N' = IJ Nij. Similarly, it follows from Lemma 2.4.1(iv) that 
for all t not in a null set N" , as n tends to infinity lim F(r; + 1/n, t) = F(n,t) for 
* = 1,2,..., lim F(n, t ) = 1, and lim F(—n, t) = 0. Therefore, for all t outside of 
the null set N'\JN", F(x,t) considered as a function of x is properly normalized, 
monotone, and continuous on the right over the rationals. For t not in N' U N" 
let F*(x, t) be the unique function that is continuous on the right in x and agrees 
with F(x, t) for all rational x. Then F*(x, t) is a cumulative distribution function 
and therefore determines a probability measure P*(A \ t) over A. We shall now 
show that P*(A \ t) is a conditional probability of A given t, by showing that 
for each fixed A it is a Z3-measurable function of t satisfying (2.23). This will be 
accomplished by proving that for each fixed A £ A 

P*(A\t)=P(A\t) ( B,P t ). 

By definition of P* this is true whenever A is one of the sets (— 00 , x] with x 
rational. It holds next when A is an interval (a, b] — (— 00 , 6 ] — (— 00 , a] with 
a,b rational, since P* is a measure and P satisfies Lemma 2.4.3(h). Therefore, 
the desired equation holds for the field T of all sets A which are finite unions 
of intervals (oi, 6 ;] with rational end points. Finally, the class of sets for which 
the equation holds is a monotone class (see Problem 2.1) and hence contains the 
smallest cr-field containing T , which is A. The measure P*(A \ t) over A was 
defined above for all t not in N' U N". However, since neither the measurability 
of a function nor the values of its integrals are affected by its values on a null set, 
one can take arbitrary probability measures over A for t in N' U N" and thereby 
complete the determination. 

If A' is a vector-valued random variable with probability distribution P x and 
T is a statistic defined over (X,A), let P x \* denote any version of the family 


' This section may be omitted at first reading. Its principal application is in the proof 
of Lemma 2.7.2(ii) in Section 2.7, which in turn is used only in the proof of Theorem 
4.4.1 
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of conditional distributions P(A \ t) over A guaranteed by Theorem 2.5.1. The 
connection with conditional expectation is given by the following theorem. ■ 


Theorem 2.5.2 If X is a vector-valued random variable and E\f(X)\ < oo, 
then 

E[f(X) \t] = I f(x)dP x ^(x) ( B,P t ). (2.24) 

Proof. Equation (2.24) holds if / is the indicator of any set A G A. It then 
follows from Lemma 2.4.1 that it also holds for any simple function and hence 
for any integrable function. 

The determination of the conditional expectation E[f(X) \ t\ given by the 
right-hand side of (2.24) possesses for each t the usual properties of an expecta¬ 
tion, (i), (iii), and (iv) of Lemma 2.4.1, which previously could be asserted only 
up to sets of measure zero depending on the functions f,g,... involved. Under 
the assumptions of Theorem 2.5.1 a similar strengthening is possible with respect 
to (ii) of Lemma 2.4.1, which can be shown to hold except possibly on a null set 
N not depending on the function h. It will be sufficient for the present purpose to 
prove this under the additional assumption that the range space of the statistic T 
is also Euclidean. For a proof without this restriction see for example Billingsley 
(1995). ■ 


Theorem 2.5.3 If T is a statistic with Euclidean domain and range spaces 
(X,A) and (T, B), there exists a determination P x ^ of the conditional probabil¬ 
ity distribution and a null set N such that the conditional expectation computed 
by 

E[f(X) | t] = J f(x)dP x '\x) 

satisfies for all t (f N. 


E[h(T)f(X) | t] = h(t)E[f(X) | t\. (2.25) 


Proof. For the sake of simplicity and without essential loss of generality suppose 
that T is real-valued. Let P X ^(A) be a probability distribution over A for each t, 
the existence of which is guaranteed by Theorem 2.5.1. For B G B, the indicator 
function /s(f) is B-measurable and 



I B {t) dP T {t)=P T {B' n B) = P X {T~ 1 B' n T _1 B) 


for all 


B' G B. 


Thus by (2.20) 

I B {t) = P x[t (T- 1 B) a.e. P T . 

Let B n , n = 1,2,..., be the intervals of T with rational end points. Then there 
exists a P-null set N = VJN n such that for t £ N 

I Bn (t) = P x ' t ( T~ l B n ) 
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for all n. For fixed t ^ N, the two set functions P x ^ and /s(t) are 

probability distributions over B, the latter assigning probability 1 or 0 to a set as 
it does or does not contain the point t. Since these distributions agree over the 
rational intervals B n , they agree for all B £ B. In particular, for t £ N, the set 
consisting of the single point t is in B, and if 

A (t) = {x : T(x) = t}, 

it follows that for all t ^ N 

p x\t = L (2.26) 

Thus 

f h[T(x)]f{x)dP xlt (x) = [ h[T(x)]f(x) dP A|t (x) 

J J a(*) 

= h(t) J f{x)dP x ^{x) 

for t £ N, as was to be proved. ■ 

It is a consequence of Theorem 2.5.3 that for all t £ N, E[h(T) \ t] = h(t) and 
hence in particular P(T £ B \ t) = 1 or 0 as t £ B or t ^ B. 

The conditional distributions P x ^ still differ from those of the elementary case 
considered in Section 1.9, in being defined over (X,A) rather than over the set 
and the tr-field A^ of its Borel subsets. However, (2.26) implies that for 
t i N 

P x]t (A) = P xlt (AnA (t) ). 

The calculations of conditional probabilities and expectations are therefore un¬ 
changed if for t ^ N, P x \ f is replaced by the distribution P x ^ t , which is defined 

over and which assigns to any subset of A ^ the same probability as 

pX\t 

Theorem 2.5.3 establishes for all t ^ N the existence of conditional probability 
distributions P x which are defined over (A^, A^) and which by Lemma 2.4.2 
satisfy 

E\f{X)]= f \f f(x)dP (x ' t \x)] dP T {t) (2.27) 

Jt-n UaW 

for all integrable functions /. Conversely, consider any family of distributions 
satisfying (2.27), and the experiment of observing first T, and then, if T — t, a 
random quantity with distribution P x The result of this two-stage procedure 
is a point distributed over (X,A) with the same distribution as the original X. 
Thus P x \ t satisfies this “functional” definition of conditional probability. 

If (X, „4) is a product space (T xy , BxC), then A^ is the product of y with the 
set consisting of the single point t. For t ^ N, the conditional distribution P x \ t 
then induces a distribution over (y,C), which in analogy with the elementary 
case will be denoted by P Y 4 . In this case the definition can be extended to all 
of T by letting P Y ^ assign probability 1 to a common specified point yo for all 
t £ IV. With this definition, (2.27) becomes 

Ef(T,Y )= [ \[ f(t,y)dP Y '\y)\ dP T (t). 

Jr Uy J 


(2.28) 
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As an application, we shall prove the following lemma, which will be used in 
Section 2.7. 


Lemma 2.5.1 Let (T,B) and (y,C) be Euclidean spaces, and let P p be a 
distribution over the product space (X , A) = (T x y, B x C). Suppose that another 
distribution Pi over (X , A) is such that 

dPi(t,y) = a(y)b(t) dP 0 (t, y), 


with a(y) > 0 for all y. Then under Pi the marginal distribution of T and a 
version of the conditional distribution of Y given t are given by 


dP f (t) = b{t) 


a (y) dp o lt (v) 


dPo(t ) 


and 


dP?\v) 


a (y) dp o ] \y) 
J y a(y') dp o lt (y') 


Proof. The first statement of the lemma follows from the equation 


Pl{TsB} = E 1 [Ib{T)] 


E 0 [I B (T)a(Y)b(T )] 



a{y) dp o ]t {y) 


dp o(t)- 


To check the second statement, one need only show that for any integrable / the 
expectation E\f(Y,T) satisfies (2.28), which is immediate. The denominator of 
dpY* i s positive, since a(y) > 0 for all y. ■ 


2.6 Characterization of Sufficiency 

We can now generalize the definition of sufficiency given in Section 1.9. If V = 
{Pg,6 £ n} is any family of distributions defined over a common sample space 
(X,A), a statistic T is sufficient for V (or for 6 ) if for each A in A there exists a de¬ 
termination of the conditional probability function Pe (A \ t ) that is independent 
of 9. As an example suppose that Xi, ..., X n are identically and independently 
distributed with continuous distribution function Fg,9 £ Q. Then it follows from 
Example 2.4.1 that the set of order statistics T(X) = (Am,..., is sufficient 
for 9. 

Theorem 2.6.1 If X is Euclidean, and if the statistic T is sufficient for V, then 
there exist determinations of the conditional probability distributions Pe(A \ t) 
which are independent of 9 and such that for each fixed t, Pg(A \ t) is a probability 
measure over A. 

Proof. This is seen from the proof of Theorem 2.5.1. By the definition of suf¬ 
ficiency one can, for each rational number r, take the functions F(r,t) to be 
independent of 9, and the resulting conditional distributions will then also not 
depend on 9. ■ 
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In Chapter 1 the definition of sufficiency was justified by showing that in a 
certain sense a sufficient statistic contains all the available information. In view 
of Theorem 2.6.1 the same justification applies quite generally when the sample 
space is Euclidean. With the help of a random mechanism one can then construct 
from a sufficient statistic T a random vector X' having the same distribution as 
the original sample vector A'. Another generalization of the earlier result, not 
involving the restriction to a Euclidean sample space, is given in Problem 2.13. 

The factorization criterion of sufficiency, derived in Chapter 1, can be extended 
to any dominated family of distributions, that is, any family V = {Pg,9 £ 12} 
possessing probability densities pg with respect to some <r-finite measure p over 
(X. A). The proof of this statement is based on the existence of a probability 
distribution A = 'Y)ciPg i (Theorem 2.2.3 of the Appendix), which is equivalent 
to V in the sense that for any A £ A 

A(A) =0 if and only if Pg = 0 for all 0 £ 12. (2.29) 


Theorem 2.6.2 Let V = {Pg,9 £ LI} be a dominated family of probability dis¬ 
tributions over (X, A), and let A = J ~f J CiPg i satisfy (2.29). Then a statistic T 
with range space (T, B) is sufficient for V if and only if there exist nonnegative 
B-measurable functions ge{t) such that 

dPg(x) = gg\T(x)\ dX(x) (2.30) 


for all 9 G 12. 


Proof. Let _4o be the subfield induced by T, and suppose that T is sufficient for 
9. Then for all 9 G 12, Aq G Ao, and A £ A 



P(A | T(x)) dPg(x) 


and since A = Y) c iP8i i 


Pg(A n A 0 ); 



P(A | T(x)) dX(x) 


A(dnAo), 


so that P (A | T(x)) serves as conditional probability function also for A. Let 
ge(T(x)) be the Radon-Nikodym derivative dPg(x)/dX(x) for (_4o,A). To prove 
(2.30) it is necessary to show that gg(T(x)) is also the derivative of Pg for (.4, A). 
If Ao is put equal to X in the first displayed equation, this follows from the 
relation 


Pe(A) = 


J P (A | T(x)) dPg(x) = J E x [I A (x) | T(x)] dPg(x) 
= J E X [I A (x) | T(x)] gg(T(x)) dX(x) 

= JE x [ge(T(x))I A (x) \ T(x)] dX(x) 

= J gg(T(x))lA(x) dX(x) = J gg{T(x)) dX(x). 


Here the second equality uses the fact, established at the beginning of the proof, 
that P(A | T(x)) is also the conditional probability for A; the third equality holds 
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because the function being integrated is „4o-measurable and because dPg = go dX 
for (.To, A); the fourth is an application of Lemma 2.4.1 (ii); and the fifth employs 
the defining property of conditional expectation. 

Suppose conversely that (2.30) holds. We shall then prove that the conditional 
probability function P\(A \ t) serves as a conditional probability function for 
all P £ V. Let gg{T{x)) = dPg(x)/ dX{x) on A and for fixed A and 6 define a 
measure v over A by the equation dv = I a dPg. Then over „4o, dv(x)/dPg(x) = 
Eo[Ia(X) | T(x)\, and therefore 

= Pe[A \ T(x)\ge(T(x)) over A 0 . 

On the other hand, dv{x) / dX(x) = lA(x)gg(T(x)) over A, and hence 

= E x [I A (X)g e (T(X))\T(x)] 

= P\[A | T(x)]gg(T(x)) over .To. 

It follows that P\(A | T(x))go(T(x)) = Pg(A \ T(x))go(T(x)) (^lo,A) and hence 
{Ao,Pe). Since ge(T(x)) ^ 0 (Ao,Pe), this shows that Pg(A \ T(x)) = P\(A \ 
T{x)) (Ao,Pe), and hence that P\(A \ T(x)) is a determination of Pg(A \ T{x)). 

■ 

Instead of the above formulation, which explicitly involves the distribution 
A, it is sometimes more convenient to state the result with respect to a given 
dominating measure fi. 

Corollary 2.6.1 (Factorization theorem) If the distributions Pg of V have 
probability densities pg = dPg/dp with respect to a a-finite measure p, then T is 
sufficient for V if and only if there exist nonnegative B-measurable functions gg 
on T and a nonnegative A-measurable function h on X such that 

Pe(x) = gg[T(x)\h(x) (A,p). (2.31) 

Proof. Let A = c iP6i satisfy (2.29). Then if T is sufficient, (2.31) follows from 
(2.30) with h = dX/dp. Conversely, if (2.31) holds, 

dX(x) = Cigg i [T(x)]h(x) dp(x) = k[T(x)]h(x) dp(x) 

and therefore dPg(x) = gl{T{x)) dX(x) where gg(t) = gg(t)/k(t) when k(t) > 0 
and may be defined arbitrarily when k(t) — 0. ■ 

For extensions of the factorizations theorem to undoininated families, see 
Ghosh, Morimoto, and Yamada (1981) and the literature cited there. 


2.7 Exponential Families 

An important family of distributions which admits a reduction by means of suf¬ 
ficient statistics is the exponential family, defined by probability densities of the 
form 

' k 

pg(x) = C(ff) exp ^2 Qj(d)Tj(x) h(x) 

J =i 


(2.32) 
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with respect to a <r-finite measure /j over a Euclidean sample space (X,A). Par¬ 
ticular cases are the distributions of a sample A' = (Xi,..., X n ) from a binomial, 
Poisson, or normal distribution. In the binomial case, for example, the density 
(with respect to counting measure) is 



p x (l-p) 


(1 -p) n exp 


x log 




Example 2.7.1 If Yi,..., Y„, are independently distributed, each with density 
(with respect to Lebesgue measure) 


Po{y) 


y l(f/ 2) 11 exp [- y / (2a 2 )] 
(2o- 2 ) //2 r(//2) 


y > o, 


(2.33) 


then the joint distribution of the X’s constitutes an exponential family. For a = 1, 
(2.33) is the density of the y 2 -distribution with / degrees of freedom; in particular 
for / an integer this is the density of X 2 , where the X’s are a sample from 

the normal distribution X(0,1). ■ 


Example 2.7.2 Consider n independent trials, each of them resulting in one of 
the s outcomes E\..... E s with probabilities pi ,..., p s respectively. If X t) is 1 
when the outcome of the ith trial is Ej and 0 otherwise, the joint distribution of 
the X’s is 

P{Xn = *n,... ,X ns } = p£ Xil p£ Xi2 • • ■p¥ Xi ‘, 

where all Xij = 0 or 1 and JT x ij = T this forms an exponential family with 
Tj(x) = ^"=1 Xi i U = Ij-’-jS — I)- The joint distribution of the T’s is the 
multinomial distribution M(n;pi,... ,p s ) given by 

E{'I\ — i,.... i — t„ i} (2.34) 

n! 

ti\... t s _i!(n - ti - - t s - 1 )! 

xpi •••p s _ 1 (i-pi- Ps- i) st .m 

If Xi,..., X n is a sample from a distribution with density (2.32), the joint 
distribution of the A'’s constitutes an exponential family with the sufficient 
statistics X^ILi T) (Xf), j = 1,..., k. Thus there exists a fc-dimensional sufficient 
statistic for (Xi,..., X n ) regardless of the sample size. Suppose conversely that 
Xi,..., X n is a sample from a distribution with some density pg (x) and that the 
set over which this density is positive is independent of 9. Then under regularity 
assumptions which make the concept of dimensionality meaningful, if there exists 
a fc-dimensional sufficient statistic with k < n, the densities pg(x) constitute an 
exponential family. For proof of this result, see Darmois (1935), Koopman (1936) 
and Pitman (1937). Regularity conditions of the result are discussed in Barankin 
and Maitra (1963), Brown (1964), Barndorff-Nielsen and Pedersen (1968), and 
Hipp (1974). 
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Employing a more natural parametrization and absorbing the factor h(x) into 
p, we shall write an exponential family in the form dPe(x) = pe(x) d^i{x) with 

' k 

pe(x) = C(9) exp E^'(z) ■ ( 2 - 35 ) 

j=i 

For suitable choice of the constant C(9), the right-hand side of (2.35) is a prob¬ 
ability density provided its integral is finite. The set Q of parameter points 
9 = (6\,... ,9k) for which this is the case is the natural parameter space of the 
exponential family (2.35). 

Optimum tests of certain hypotheses concerning any 9j are obtained in Chapter 
4. We shall now consider some properties of exponential families required for this 
purpose. 

Lemma 2.7.1 The natural parameter space of an exponential family is convex. 

Proof. Let (9 1 ,..., 9 k ) and (6*(,..., 9' k ) be two parameter points for which the 
integral of (2.35) is finite. Then by Holder’s inequality, 

I eX ^ [E K- + (! - «)?*] T A x )\ Mx) 

< [y exp [E^' T f(®)] dp(x) J exp [E^'O®)] d p{x) < 00 
for any 0 < a < 1. 

If the convex set fl lies in a linear space of dimension < k, then (2.35) can be 
rewritten in a form involving fewer than k components of T. We shall therefore, 
without loss of generality, assume Q to be fc-dimensional. 

It follows from the factorization theorem that T(x) = (Ti(x),... ,T k (x)) is 
sufficient for V = {Pe, 9 € fi}. ■ 


Lemma 2.7.2 Let X be distributed according to the exponential family 

r s 

dPg^(x) = C(9,d)ex p y^9jUj(x) + E^J^J'( a; ) dp(x). 

J =1 3 —1 

Then there exist measures \e and vt over s- and r-dimensional Euclidean space 
respectively such that 

(i) the distribution of T = (Ti ,,T 3 ) is an exponential family of the form 


dPj^(t) = C(9, d) exp (E^J d\e{t), (2.36) 

(ii) the conditional distribution ofU= (Ui ,..., U r ) given T = t is an exponential 
family of the form 

dPg ^(u) = C(0)exp fE^*! dv t {u), (2.37) 


and hence in particular is independent of 9. 
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Proof. Let (8°,i9°) be a point of the natural parameter space, and let p* = 
Pgo ,'jo • Then 


c(e,d) 

C(8°,d°) 

r s 

x exp — di)Ui(x) + — tfj)Tj(x) dp*(x) 


dP^ 0 ^o{x) = 


and the result follows from Lemma 2.5.1, with 


d\ e ( t ) = exp ( - ^ d° t2j f exp ^ (0, - 0°)u; dP^ 0 («) dPg 0 (f) 

^ i= 1 J 


dv t {u) = exp ^ <9°Ui) dP e l o^o(«)- 


Theorem 2.7.1 Let <j> be any function on (X,A) for which the integral 


a ^ 

J <j>(x) exp E QjTA x ) d n(x) 


considered as a function of the complex variables 6j = £,• + ir/j (j = 1,..., k) 
exists for all (£i,..., £*,) £ LI and is finite. Then 

(i) the integral is an analytic function of each of the 8’s in the region R of 
parameter points for which (£i,...,£*,) is an interior point of the natural 
parameter space LI; 

(ii) the derivatives of all orders with respect to the 8’s of the integral (2.38) can 
be computed under the integral sign. 

Proof. Let (£i,..., fk) be any fixed point in the interior of LI, and consider one 
of the variables in question, say 9 1 . Breaking up the factor 

4>{x) exp [(£2 + m?°) T 2 (x) 4-f (£° 4- irfk) T k {x)\ 

into its real and complex part and each of these into its positive and negative 
part, and absorbing this factor in each of the four terms thus obtained into the 
measure p, one sees that as a function of 9\ the integral (2.38) can be written as 

J exp [0iTi(a:)] dp,(x) - J exp [0iTi(*)] dp 2 (x) 

4 -i J exp [0iTi(a:)] dps(x) — i J exp [0iTi(a:)] dp 4 (x). 

It is therefore sufficient to prove the result for integrals of the form 


V>(6»i) = 


= J exp[ft 


T,(x)\ dp(x). 


Since (£?,..., £°) is in the interior of fi, there exists 8 > 0 such that ip( 8 1 ) exists 
and is finite for all 0i with |£i — £i| <8. Consider the difference 

V>(0i) - ip{6i) _ f exp[0iTi(x)]-exp[0?7i(a;)] ; ( ^ 

8,-8° ~J 8,-8° 
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The integrand can be written as 


exp [0?Ti(x)] 


exp [(0i - 0?)Ti(a:)] - 1 


0i-0? 

Applying to the second factor the inequality 


exp(az) — 1 


< 


exp((5|a|) 


for |z| < <5, 


the integrand is seen to be bounded above in absolute value by 


1 


exp (0?7i +<5|Ti|) 


- 5 


exp [(0? + S ) 7i] + exp [(0? - S) Ti] 


for 10i — 0? | < <5. Since the right-hand side integrable, it follows from the Lebesgue 
dominated-convergence theorem [Theorem 2.2.2(h)] that for any sequence of 
points 6[ n) tending to 0?, the difference quotient of ip tends to 

J Ti(x)exp [0?Xi(a:)] dp(x). 

This completes the proof of (i), and proves (ii) for the first derivative. The proof 
for the higher derivatives is by induction and is completely analogous. ■ 


2.8 Problems 

Section 2.1 

Problem 2.1 Monotone class. A class T of subsets of a space is a field if it 
contains the whole space and is closed under complementation and under finite 
unions; a class M is monotone if the union and intersection of every increasing 
and decreasing sequence of sets of M is again in M . The smallest monotone class 
Mo containing a given field T coincides with the smallest cr-field A containing 
T. [One proves first that Mo is a field. To show, for example, that A n B £ Mo 
when A and B are in Mo, consider, for a fixed set A £ IF, the class Ma of all 
B in Mo for which A 0 B £ Mo- Then Ma is a monotone class containing T, 
and hence Ma = Mo- Thus An B £ Ma for all B. The argument can now 
be repeated with a fixed set B £ Mo and the class Mb of sets A in Mo for 
which A fi B £ Mo- Since Mo is a field and monotone, it is a cr-field containing 
T and hence contains A. But any cr-field is a monotone class so that also Mo is 
contained in A.] 

Section 2.2 

Problem 2.2 Prove Corollary 2.2.1 using Theorems 2.2.1 and 2.2.2. 


Problem 2.3 Radon-Nikodym derivatives. 
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(i) If A and /r are cr-finite measures over ( X , A) and /r is absolutely continuous 
with respect to A, then 

I fd “=I fd £ dX 

for any /r-integrable function /. 

(ii) If A, /i, and v are cr-finite measures over ( X,A ) such that v is absolutely 
continuous with respect to fj, and /r with respect to A, then 

dv dv d^i 
d\ d/r d\ 

(iii) If /r and v are cr-finite measures,, which are equivalent in the sense that each 
is absolutely continuous with respect to the other, then 



(iv) If = 1,2,..., and /r are finite measures over ( X,A ) such that 

= A*(A) for all A £ A, and if the /r*, are absolutely continuous 
with respect to a cr-finite measure A, then /r is absolutely continuous with respect 
to A, and 


d J2 Mfe n , d J2 Mfc , 

k =i _ a^k j. k =l _ 041 

d\ dX 1 n—roo d\ dX 

k =1 


a.e. A. 


[(i): The equation in question holds when / is the indicator of a set, hence when 
/ is simple, and therefore for all integrable /. 

(ii): Apply (i) with / = du/d^i] 


Problem 2.4 If f(x) > 0 for all x € S and /r is cr-Hnite, then J s f d^i = 0 implies 

l>(S) o. 

[Let S n be the subset of S on which /(*) > 1/n Then n(S) < n(Sn) and 
li(S n ) < n f Sn f dfi < n f s f d/.t = 0.] 


Section 2.3 

Problem 2.5 Let (X,A) be a measurable space, and Ao a cr-field contained in 
A. Suppose that for any function T, the cr-field B is taken as the totality of sets B 
such that T~ 1 (B) £ A. Then it is not necessarily true that there exists a function 
T such that T~ 1 {B) £ Ao- [An example is furnished by any Ao such that for all 
x the set consisting of the single point x is in Ao-] 


Section 2-4 

Problem 2.6 (i) Let V be any family of distributions A' = (Ai,..., X n ) such 

that 

P{(A i ,A i+ i,...,A n ,Ai,...,A i _i) € A} = P{( AT,...,A„) £ A} 
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for all Borel sets A and alii = 1,..., n. For any sample point (xi, ..., x n ) 
define (yi,...,y n ) = (xt, x i+ i, ..., x n , xi, ..., Xi-i), where xt = X(i) = 
min(*i,. .., x n )- Then the conditional expectation of f(X) given Y = y is 

fo{yi,---,yn) = ~[f(yi,-■ ■ ,y n ) + / (2/2,... ,y n ,yi) 

H- f{yn,yi, - ■ ■ ,y n -i)\- 

(ii) Let G = {g\,...,g r } be any group of permutations of the coordinates 
xi,x n of a point x in n-space, and denote by gx the point obtained by 
applying g to the coordinates of x. Let V be any family of distributions P 
of X = (Xi,..., X n ) such that 

P\gX eA}= P\X £ A} for all g £ G. (2.39) 

For any point x let t = T(x) be any rule that selects a unique point from 
the r points gtx,k = (for example the smallest first coordinate 

if this defines it uniquely, otherwise also the smallest second coordinate, 
etc.). Then 

E[f(X)\t] = lj2f(g k t). 

( k= 1 

(iii) Suppose that in (ii) the distributions P do not satisfy the invariance 
condition (2.39) but are given by 

dP(x) = h(x) dg,(x), 

where g is invariant in the sense that g{x : gx £ A} = g{A). Then 

E f(gkt)h(g k t) 

£[/(*)|t] = —t-• 

E h{g k t) 

k = 1 


Section 2.5 

Problem 2.7 Prove Theorem 2.5.1 for the case of an n-dimensional sample 
space. [The condition that the cumulative distribution function is nondecreasing 
is replaced by P{x 1 < X\ < x [,..., x n < X n < x' n } > 0; the condition that it is 
continuous on the right can be stated as lim™-^ F(xi + 1/m,... ,x n + 1/m) = 
F(x 1 ,.. .,*„).] 


Problem 2.8 Let X = y x T ’, and suppose that Po,Pi are two probability 
distributions given by 


dPo{y,t) = f(y)g(t)dn(y)du{t ), 
dPi(y,t) = h(y,t)dn(y)dv(t), 


where h(y,t)/f(y)g(t) < 00 . Then under Pi the probability density of Y with 
respect to g is 


Pi{y) 


f(y)Eo 


' KViT) 
J{y)g(T) 
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[We have 

Pi(y) = J r Hi/, t) dv{t) = f(y) J * dv(t)-} 


Section 2.6 

Problem 2.9 Symmetric distributions. 

(i) Let V be any family of distributions of X = (Xi,... ,X n ) which are 
symmetric in the sense that 

P{(x il ,...,x i jeA} = P{(x 1 ,...,x n )eA} 

for all Borel sets A and all permutations (i 1 ,..., in) of (1,..., n). Then the 
statistic T of Example 2.4.1 is sufficient for P, and the formula given in the 
first part of the example for the conditional expectation E[f(X) \ T(x)\ is 
valid. 

(ii) The statistic Y of Problem 2.6 is sufficient. 

(iii) Let Xi,.... X n be identically and independently distributed according to 
a continuous distribution P £ P, and suppose that the distributions of V 
are symmetric with respect to the origin. Let Vi = \Xi\ and Wi = Vu\. 
Then (Wi, ..., W„) is sufficient for V. 

Problem 2.10 Sufficiency of likelihood ratios. Let Po,Pi be two distributions 
with densities Po,pi- Then T(x) = pi(x)/po(x) is sufficient for V = {Po,Pi}. 
[This follows from the factorization criterion by writing pi = T ■ po,Po = 1 ■ Po-\ 


Problem 2.11 Pairwise sufficiency. A statistic T is pairwise sufficient for V if 
it is sufficient for every pair of distributions in V. 

(i) If V is countable and T is pairwise sufficient for P, then T is sufficient for 

V. 

(ii) If V is a dominated family and T is pairwise sufficient for P, then T is 
sufficient for V. 


[(i): Let V = {Po,Pi,...}, and let Ao be the sufficient subfield induced by T. 
Let A = ^2 dPi (Ci > 0) be equivalent to V. For each j = 1,2,... the probability 
measure Xj that is proportional to ( co/n)Po + CjPj is equivalent to {Po,P,}. 
Thus by pairwise sufficiency, the derivative fj = dPo/[(co/n) dPo + Cj dPj] is 
Ao-measurable. Let Sj = {x : fj(x) = 0} and S = \J” = 1 Sj. Then S' £ Ao, 
Po(S) = 0, and on X — S the derivative dPo /CjPj equals E"=i l//j) -1 
which is Ao-measurable. It then follows from Problem 2.3 that 


dP 0 

dX 


n 

.., <> Y cj'; 

dP 0 j=o 


n 


d J2 CjPj 
3=0 


dX 


is also Ao-measurable. (ii): Let A = c :i equivalent to V. Then pairwise 

sufficiency of T implies for any do that dPg 0 /(dPg 0 + dX) and hence dPg 0 /dX is a 
measurable function of T.] 



54 2. The Probability Background 


Problem 2.12 If a statistic T is sufficient for V , then for every function / which 
is {A, P@)-integrable for all 9 £ ft there exists a determination of the conditional 
expectation function Eg[f(X) | t\ that is independent of 9. [If X is Euclidean, this 
follows from Theorems 2.5.2 and 2.6.1. In general, if / is nonnegative there exists 
a nondecreasing sequence of simple nonnegative functions /„ tending to /. Since 
the conditional expectation of a simple function can be taken to be independent 
of 9 by Lemma 2.4.1 (i), the desired result follows from Lemma 2.4.1(iv).[ 

Problem 2.13 For a decision problem with a finite number of decisions, the class 
of procedures depending on a sufficient statistic T only is essentially complete. 
[For Euclidean sample spaces this follows from Theorem 2.5.1 without any restric¬ 
tion on the decision space. For the present case, let a decision procedure be given 
by 5{x) = (<5^(a:),..., S^ m \x)) where S^ l \x) is the probability with which deci¬ 
sion di is taken when x is observed. If T is sufficient and = E[S^\X) \ t], 

the procedures S and y have identical risk functions.] [More general versions of this 
result are discussed, for example, by Elfving (1952), Bahadur (1955), Burkholder 
(1961), LeCam (1964), and Roy and Ramamoorthi (1979).] 

Section 2.1 

Problem 2.14 Let A'; (i = 1,..., s) be independently distributed with Poisson 
distribution P(Ai), and let To = E Xj. Ti = A.,, A = E A,-. Then To has the 
Poisson distribution P( A), and the conditional distribution of Tl, ..., T s _i given 
To = to is the multinomial distribution (2.34) with n = to and pi = A;/A. 

Problem 2.15 Life testing. Let Xi,..., X n be independently distributed with 
exponential density ( 29)~ 1 e~ x ^ 2e for x > 0, and let the ordered A"’s be denoted 
by Yf < Y 2 < ■ • • < Y n . It is assumed that Yi becomes available first, then I 2 , 
and so on, and that observation is continued until Y r has been observed. This 
might arise, for example, in life testing where each X measures the length of life 
of, say, an electron tube, and n tubes are being tested simultaneously. Another 
application is to the disintegration of radioactive material, where n is the number 
of atoms, and observation is continued until r a-particles have been emitted. 

(i) The joint distribution of Yi,..., Y r is an exponential family with density 

r 

E Vi + {n- r)y r 

i =1_ 

29 

(ii) The distribution of [EI=i Yi + (n — r)Y r \/9 is \ 2 with 2r degrees of freedom. 

(iii) Let Yi,Y 2 ,... denote the time required until the first, second, ... event 
occurs in a Poisson process with parameter 1/2 9' (see Problem 1.1). Then 
Ai = Yi/0', Z 2 = (Y 2 - Yi)/0', Z 3 = (Y 3 - Y 2 )/9',... are independently 
distributed as with 2 degrees of freedom, and the joint density Yi,..., Y r 
is an exponential family with density 



0 < yi < ■ ■ ■ < Dr- 


1 n! 

(2 Wy (■ n-r)\ 6XP 


0 < yi < ■ ■ ■ < yr- 
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The distribution of Y r /0' is again x 2 with 2 r degrees of freedom. 

(iv) The same model arises in the application to life testing if the number n of 
tubes is held constant by replacing each burned-out tube with a new one, 
and if Yi denotes the time at which the first tube burns out, Y 2 the time 
at which the second tube burns out, and so on, measured from some fixed 
time. 

[(ii): The random variables Zi = (n — i + 1)(1* — Yj_i)/0 (i = 1,2,. .. ,r) are 
independently distributed as X' 2 with 2 degrees of freedom, and E r = i + (n — 
r)Yr/OmU=iZi.} 

Problem 2.16 For any 9 which is an interior point of the natural parameter 
space, the expectations and covariances of the statistics T, in the exponential 
family (2.35) are given by 

E[TAX)\ = - dl0 g^ {6) (J = !.•••.*), 

ElTiiXWXy-lETiWETjiX)] = (LJ '. kl 

Problem 2.17 Let 11 be the natural parameter space of the exponential family 
(2.35), and for any fixed t r + i, ... ,tk (r < k) let fig 6r be the natural parameter 
space of the family of conditional distributions given T r +i = t r +i ,... ,Tk =tk- 

(i) Then Og lj g r contains the projection Qg lt ... l e r of 11 onto 6i, ..., 9 r . 

(ii) An example in which Q.e 1} ...,e r is a proper subset of 6r is the family 

of densities 

pg 1 g 2 (x, y) = C(9i, 9 2 ) exp(9tx + 9 2 y - xy), x, y > 0. 


2.9 Notes 

The theory of measure and integration in abstract spaces and its application 
to probability theory, including in particular conditional probability and expec¬ 
tation, is treated in a number of books, among them Dudley (1989), Williams 
(1991) and Billingsley (1995). The material on sufficient statistics and expo¬ 
nential families is complemented by the corresponding sections in TPE2. Much 
fuller treatments of exponential families (as well as sufficiency) are provided by 
Barndorff-Nielsen (1978) and Brown (1986). 



3 

Uniformly Most Powerful Tests 


3.1 Stating The Problem 

We now begin the study of the statistical problem that forms the principal subject 
of this book, the problem of hypothesis testing. As the term suggests, one wishes 
to decide whether or not some hypothesis that has been formulated is correct. The 
choice here lies between only two decisions: accepting or rejecting the hypothesis. 
A decision procedure for such a problem is called a test of the hypothesis in 
question. 

The decision is to be based on the value of a certain random variable X, the 
distribution Pg of which is known to belong to a class V = {Pg, 9 £ SI}. We shall 
assume that if 9 were known, one would also know whether or not the hypothesis 
is true. The distributions of V can then be classified into those for which the 
hypothesis is true and those for which it is false. The resulting two mutually 
exclusive classes are denoted by H and A', and the corresponding subsets of SI by 
Qh and SIk respectively, so that HuK = P and Qh U Qk = SI. Mathematically, 
the hypothesis is equivalent to the statement that Pg is an element of A. It is 
therefore convenient to identify the hypothesis with this statement and to use 
the letter H also to denote the hypothesis. Analogously we call the distributions 
in K the alternatives to H , so that K is the class of alternatives. 

Let the decisions of accepting or rejecting H be denoted by do and d\ respec¬ 
tively. A nonrandomized test procedure assigns to each possible value x of X one 
of these two decisions and thereby divides the sample space into two complemen¬ 
tary regions So and Si. If A' falls into So, the hypothesis is accepted; otherwise 
it is rejected. The set So is called the region of acceptance, and the set Si the 
region of rejection or critical region. 
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When performing a test one may arrive at the correct decision, or one may 
commit one of two errors: rejecting the hypothesis when it is true (error of the first 
kind) or accepting it when it is false (error of the second kind). The consequences 
of these are often quite different. For example, if one tests for the presence of 
some disease, incorrectly deciding on the necessity of treatment may cause the 
patient discomfort and financial loss. On the other hand, failure to diagnose the 
presence of the ailment may lead to the patient’s death. 

It is desirable to carry out the test in a manner which keeps the probabilities 
of the two types of error to a minimum. Unfortunately, when the number of 
observations is given, both probabilities cannot be controlled simultaneously. It 
is customary therefore to assign a bound to the probability of incorrectly rejecting 
H when it is true and to attempt to minimize the other probability subject to 
this condition. Thus one selects a number a between 0 and 1, called the level of 
significance, and imposes the condition that 

P e {<5(X) = di} = Pg{X £ Si} < a for all 9 £ n H - (3.1) 

Subject to this condition, it is desired to minimize Pe{<5(X) = do} for 6 in flu 
or, equivalently, to maximize 

Pj{< 5(A) = di} = Pg{X £ Si} for all 9 £ n K - (3.2) 

Although usually (3.2) implies that 

supPe{A £ Si} = a, (3.3) 

n H 

it is convenient to introduce a term for the left-hand side of (3.3): it is called 
the size of the test or critical region Si. The condition (3.1) therefore restricts 
consideration to test whose size does not exceed the given level of significance. 
The probability of rejection (3.2) evaluated for a given 9 in Qk is called the power 
of the test against the alternative 9. Considered as a function of 9 for all 9 £ Q, 
the probability (3.2) is called the power function of the test and is denoted by 

m- 

The choice of a level of significance a is usually somewhat arbitrary, since in 
most situations there is no precise limit to the probability of an error of the first 
kind that can be tolerated. 1 Standard values, such as .01 or .05, were originally 
chosen to effect a reduction in the tables needed for carrying out various test. By 
habit, and because of the convenience of standardization in providing a common 
frame of reference, these values gradually became entrenched as the conventional 
levels to use. This is unfortunate, since the choice of significance level should also 
take into consideration the power that the test will achieve against the alterna¬ 
tives of interest. There is little point in carrying out an experiment which has 
only a small chance of detecting the effect being sought when it exists. Surveys 
by Cohen (1962) and Freiman et al. (1978) suggest that this is in fact the case 
for many studies. Ideally, the sample size should then be increased to permit ade¬ 
quate values for both significance level and power. If that is not feasible one may 
wish to use higher values of a than the customary ones. The opposite possibility, 


1 The standard way to remove the arbitrary choice of a is to report the p -value of 
the test, defined as the smallest level of significance leading to rejection of the null 
hypothesis. This approach will discussed toward the end of Section 3.3. 
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that one would like to decrease a, arises when the latter is so close to 1 that 
a can be lowered appreciably without a significant loss of power (cf. Problem 
3.11). Rules for choosing a in relation to the attainable power are discussed by 
Lehmann (1958), Arrow (1960), and Sanathanan (1974), and from a Bayesian 
point of view by Savage (1962, pp. 64-66). See also Rosenthal and Rubin (1985). 

Another consideration that may enter into the specification of a significance 
level is the attitude toward the hypothesis before the experiment is performed. If 
one firmly believes the hypothesis to be true, extremely convincing evidence will 
be required before one is willing to give up this belief, and the significance level 
will accordingly be set very low. (A low significance level results in the hypothesis 
being rejected only for a set of values of the observations whose total probability 
under hypothesis is small, so that such values would be most unlikely to occur if 
H were true.) 

Let us next consider the structure of a randomized test. For any values x, such 
a test chooses between the two decisions, rejection or acceptance, with certain 
probabilities that depend on x and will be denoted by (j>{x) and 1 — <j>{x) re¬ 
spectively. If the value of A' is x, a random experiment is performed with two 
possible outcomes R and R, the probabilities of which are (j>{x ) and 1 — <f(x). If in 
this experiment R occurs, the hypothesis is rejected, otherwise it is accepted. A 
randomized test is therefore completely characterized by a function <j>, the critical 
function, with 0 < <f(x) < 1 for all x. If takes on only the values 1 and 0, one is 
back in the case of a nonrandomized test. The set of points x for which <j>(x) = 1 
is then just the region of rejection, so that in a nonrandomized test </> is simply 
the indicator function of the critical region. 

If the distribution of X is Pg, and the critical function <j> is used, the probability 
of rejection is 

E a 4>{X) = J 4>(x)dP e {x), 

the conditional probability 4>(x) of rejection given x, integrated with respect to 
the probability distribution of X. The problem is to select <j> so as to maximize 
the power 

MO) = Eg<t>(X) for all 6 € Q K (3.4) 

subject to the condition 

Egcj>(X) < a for all 9 £ fin- (3.5) 

The same difficulty now arises that presented itself in the general discussion of 
Chapter 1. Typically, the test that maximized the power against a particular 
alternative in K depends on this alternative, so that some additional principal 
has to be introduced to define what is meant by an optimum test. There is 
one important exception: if K contains only one distribution, that is, if one is 
concerned with a single alternative, the problem is completely specified by (3.4) 
and (3.5). It then reduces to the mathematical problem of maximizing an integral 
subject to certain side conditions. The theory of this problem, and its statistical 
applications, constitutes the principle subject of the present chapter. In special 
cases it may of course turn out that the same test maximizes the power of all 
alternatives in K even when there is more than one. Examples of such uniformly 
most powerful (UMP) tests will be given in Section 3.4 and 3.7. 
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In the above formulation the problem can be considered as special case of the 
general decision problem with two types of losses. Corresponding to the two kinds 
of error, one can introduce the two component loss functions, 


and 


Li(0, di) = 1 or 0 
Ti(0, do) = 0 


as 9 £ Qh or 9 £ Qk, 
for all 9 


I/ 2 (#, do) = 0 or 1 
L 2 {9, di) = 0 


as 8 £ Qh or 0 £ Qk, 
for all 0 . 


With this definition the minimization of EL 2 (9, <5(A')) subject to the restriction 
ELi(8,5(X)) < a is exactly equivalent to the problem of hypothesis testing as 
given above. 

The formal loss functions Li and L 2 clearly do not represent in general the 
true losses. The loss resulting from an incorrect acceptance of the hypothesis, 
for example, will not be the same for all alternatives. The more the alternative 
differs from the hypothesis, the more serious are the consequences of such an 
error. As was discussed earlier, we have purposely foregone the more detailed 
approach implied by this criticism. Rather than working with a loss function 
which in practice one does not know, it seems preferable to base the theory on 
the simpler and intuitively appealing notion of error. It will be seen later that at 
least some of the results can be justified also in the more elaborate formulation. 


3.2 The Neyman-Pearson Fundamental Lemma 

A class of distributions is called simple if it contains a single distribution, and 
otherwise it is said to be composite. The problem of hypothesis testing is com¬ 
pletely specified by (3.4) and (3.5) if K is simple. Its solution is easiest and can 
be given explicitly when the same is true of H. Let the distributions under a 
simple hypothesis H and alternative K be Po and Pi, and suppose for a moment 
that these distributions are discrete with P;{A' = *} = Pi(x) for i = 0,1. If at 
first one restricts attention to nonrandomized tests, the optimum test is defined 
as the critical region S satisfying 

£Po(*)<« (3.6) 


Pi (*) = maximum . 

x£S 

It is easy to see which points should be included in S. To each point are attached 
two values, its probability under Po and under Pi. The selected points are to have 
a total value not exceeding a on the one scale, and as large as possible on the 
other. This is a situation that occurs in many contexts. A buyer with a limited 
budget who wants to get “the most for his money” will rate the items according to 
their value per dollar. In order to travel a given distance in the shortest possible 
time, one must choose the quickest mode of transportation, that is, the one that 
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yields the largest number of miles per hour. Analogously in the present problem 
the most valuable points x are those with the highest value of 


r{x) 


EM 

PM 


The points are therefore rated according to the value of this ratio and selected 
for S in this order, as many as one can afford under restriction (3.6). Formally this 
means that S is the set of all points x for which r(x) > c, where c is determined 
by the condition 


Po{X £S}= J2 P M = a ■ 

x:r(x)>c 


Here a difficulty is seen to arise. It may happen that when a certain point is 
included, the value a has not yet been reached but that it would be exceeded if 
the point were also included. The exact value a can then either not be achieved 
at all, or it can be attained only by breaking the preference order established by 
r(x). The resulting optimization problem has no explicit solution. (Algorithms 
for obtaining the maximizing set S are given by the theory of linear program¬ 
ming.) The difficulty can be avoided, however, by a modification which does not 
require violation of the r-order and which does lead to a simple explicit solution, 
namely by permitting randomization. 2 This makes it possible to split the next 
point, including only a portion of it, and thereby to obtain the exact value a 
without breaking the order of preference that has been established for inclusion 
of the various sample points. These considerations are formalized in the following 
theorem, the fundamental lemma of Neyman and Pearson. 


Theorem 3.2.1 Let Po and Pi be probability distributions possessing densities 
p o and pi respectively with respect to a measure p. 3 

(i) Existence. For testing H : po against the alternative K : pi there exists a 
test 4> and a constant k such that 


and 


E 0 <t>{X) = a 



when pi{x) > kpo(x), 
when Pi(x) < kpo(x). 


(3.7) 

(3.8) 


(ii) Sufficient condition for a most powerful test. If a test satisfies (3.7) and 
(3.8) for some k, then it is most powerful for testing po against pi at level a. 

(iii) Necessary condition for a most powerful test. If <j> is most powerful at 
level a for testing po against pi, then for some k it satisfies (3.8) a.e. p. It also 
satisfies (3.7) unless there exists a test of size < a and with power 1. 


Proof. For a = 0 and a = 1 the theorem is easily seen to be true provided the 
value k = + oo is admitted in (3.8) and 0 • oo is interpreted as 0. Throughout the 
proof we shall therefore assume 0 < a < 1. 


2 In practice, typically neither the breaking of the r-order nor randomization is con¬ 
sidered acceptable. The common solution, instead, is to adopt a value of a that can be 
attained exactly and therefore does not present this problem. 

3 There is no loss of generality in this assumption, since one can take p = Pq + Pi- 
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(i): Let a(c) = Po{pi(-Y) > cpo(X)}. Since the probability is computed under 
Po, the inequality need be considered only for the set where po(x) > 0, so that 
a(c) is the probability that the random variable pi(X)/po(X) exceeds c. Thus 
1 — a(c ) is a cumulative distribution function, and a(c) is nonincreasing and 
continuous on the right, a(c — 0) — a(c) = Po{pi(X)/po(X) = c},a(—oo) = 1, 
and a(oo) = 0. Given any 0 < a < 1, let Co be such that a(co) < a < a(co — 0), 
and consider the test (p defined by 

{ 1 when pi{x) > c 0 p 0 (x), 

a (co-0)- O J(c o ) when Pi(x) = c 0 p 0 (x), 

0 when pi{x) < copo(x). 


Here the middle expression is meaningful unless a(co) = a(co — 0); since then 
Po{pi(X) = copo(X)} = 0, (f) is defined a.e. The size of cj) is 


E 0 <j>(X) = Po 


- o 


\po(X) 


> Co 


+ «- q(cq) p 

a(co — 0) — a(co) 


. fpiW _ 1 

°\po(X) 7 


so that Co can be taken as the k of the theorem. 

(ii): Suppose that 0 is a test satisfying (3.7) and (3.8) and that rf>* is any 
other test with Eo<f)*(X) < a. Denote by S + and S~ the sets in the sample space 
where <j>(x) — <p*(x) > 0 and < 0 respectively. If x is in S + , 4>{x) must be > 0 and 
pi(x) > kpo(x). In the same way pi(x) < kpo(x) for all x in S~, and hence 


f((p- <t>*)(pi ~ kpo) dp= I {(j) - cp*)(pi - kp 0 ) dp > 0. 
J Js+us- 


The difference in power between <f> and <j >* therefore satisfies 



<t>*)pi dp> k /(*- (j>*)po dp > 0, 


as was to be proved. 

(iii): Let <j>* be most powerful at level a for testing po against pi, and let (j> 
satisfy (3.7) and (3.8). Let S be the intersection of the set S + U S~, on which 
(j> and (f>* differ, with the set {x : pi(x) kpo(x)}, and suppose that p(S) > 0. 
Since ( rf) — (j>*){p\ — kpo) is positive on S, it follows from Problem 2.4 that 


/ 4>*){pi - kp 0 )dp = / (</> - cp*)(pi - kp 0 )dp > 0 

Js+us- Js 

and hence that <j> is more powerful against pi than </>*. This is a contradiction, 
and therefore p(S) = 0, as was to be proved. 

If </>* were of size < a and power < 1, it would be possible to include in the 
rejection region additional points or portions of points and thereby to increase 
the power until either the power is 1 or the size is a. Thus either Eq 4>*(X) = 
a or Ei<j>*(X) = 1. ■ 

The proof of part (iii) shows that the most powerful test is uniquely determined 
by (3.7) and (3.8) except on the set on which pi(x) = kpo(x). On this set, (j> can 
be defined arbitrarily provided the resulting test has size a. Actually, we have 
shown that it is always to define <j> to be constant over this boundary set. In the 
trivial case that there exists a test of power 1, the constant k of (3.8) is 0, and 
one will accept H for all points for which pi(x) = kpo(x) even though the test 
may then have size < a. 
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It follows from these remarks that the most powerful test is determined 
uniquely (up to sets of measure zero) by (3.7) and (3.8) whenever the set on 
which pi(x) = kpo(x) has /r-measure zero. This unique test is then clearly non- 
randomized. More generally, it is seen that randomization is not required except 
possibly on the boundary set, where it may be necessary to randomize in order 
to get the size equal to a. When there exists a test of power 1, (3.7) and (3.8) 
will determine a most powerful test, but it may not be unique in that there may 
exist a test also most powerful and satisfying (3.7) and (3.8) for some a' < a. 

Corollary 3.2.1 Let (3 denote the power of the most powerful level-a test (0 < 
a < 1) for testing Po against Pi. Then a < /3 unless Po = Pi- 

Proof. Since the level-a test given by (j>{x) = a has power a, it is seen that 
a < (3. li a = f3 < 1, the test <t>(x) = a is most powerful and by Theorem 
3.2.1 (iii) must satisfy (3.8). Then po(x) = pi(x) a.e. p and hence Po = Pi- ■ 

An alternative method for proving some of the results of this section is based 
on the following geometric representation of the problem of testing a simple 
hypothesis against a simple alternative. Let N be the set of all points (a, (3) for 
which there exists a test <f> such that 

a = Eo4>(X), (3 = Ei4>{X). 

This set is convex, contains the points (0,0) and (1,1), and is symmetric with 
respect to the point (|, |) in the sense that with any point (a, (3) it also contains 
the point (1 — a, 1 — (3). In addition, the set N is closed. [This follows from the 
weak compactness theorem for critical functions, Theorem A.5.1 of the Appendix; 
the argument is the same as that in the proof of Theorem 3.6.1 (i).] 

For each value 0 < ao < 1, the level-ao tests are represented by the points 
whose abscissa is < a 0 . The most powerful of these tests (whose existence follows 
from the fact that N is closed) corresponds to the point on the upper boundary 
of N with abscissa ao- This is the only point corresponding to a most powerful 
level-ao test unless there exists a point (a, 1) in N with a < ao (Figure 3.16). 




Figure 3.1. 
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As a example of this geometric approach, consider the following alternative 
proof of Corollary 3.2.1. Suppose that for some 0 < ao < 1 the power of the 
most powerful level-ao test is ao- Then it follows from the convexity of N that 
(a, (3) £ N implies /3 < a, and hence from the symmetry of N that N consists 
exactly of the line segment connecting the points (0,0) and (1,1). This means 
that f (j>p 0 dp, = J (j>pi dp for all cj> and hence that po = pi (a.e./r), as was to be 
proved. A proof of Theorem 3.2.1 along these lines is given in a more general 
setting in the proof of Theorem 3.6.1. 


Example 3.2.1 Suppose X is an observation from N(£,a 2 ), with a 2 known. 
The null hypothesis specifies £ = 0 and the alternative specifies £ = £i for some 
£i > 0. Then, the likelihood ratio is given by 


Pi(x) ex P[— 2^2 (A ~ £i) 2 ] ■„ . rfrs £ 
Po(x) exp[— ^X 2 ] P cr 2 2cr 2 


(3.9) 


Since the exponential function is strictly increasing and > 0, the set of x where 
Pi{x)/po(x) > k is equivalent to the set of x where x > k'. In order to determine 
k ', the level constraint 


P 0 {X > k'} = a 


must be satisfied, and so k' = tjz\- a , where zi- a is the 1 — a quantile of the 
standard normal distribution. Therefore, the most powerful level a test rejects if 
X > azi- a ■ ■ 


3.3 p -values 

Testing at a fixed level a as described in Sections 3.1 and 3.2 is one of two standard 
(non-Bayesian) approaches to the evaluation of hypotheses. To explain the other, 
suppose that, under Po, the distribution of pi(X)/po(X) is continuous. Then, 
the most powerful level a test is nonrandomized and rejects if pi(X)/po(X) > k, 
where k = k(a) is determined by (3.7). For varying a, the resulting tests provide 
an example of the typical situation in which the rejection regions S a are nested 
in the sense that 


S a C S a > if a < a . (3.10) 

When this is the case, 4 it is good practice to determine not only whether the 
hypothesis is accepted or rejected at the given significance level, but also to 
determine the smallest significance level, or more formally 

p = p(X) = inf {a : IeS„), (3.11) 

at which the hypothesis would be rejected for the given observation. This num¬ 
ber, the so-called p-value gives an idea of how strongly the data contradict the 


4 See Problems 3.17 and 3.58 for examples where optimal nonrandomized tests need 
not be nested. 
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hypothesis. 5 It also enables others to reach a verdict based on the significance 
level of their choice. 


Example 3.3.1 (Continuation of Example 3.2.1) Let <f> denote the stan¬ 
dard normal c.d.f. Then, the rejection region can be written as 

S a = {X : X > a Zl - a } ={X : $( —) > 1 - a} = {X : 1 - $( —) < a} . 

(7 (T 

For a given observed value of X , the inf over all a where the last inequality holds 
is 

P = !-*(-) • 
a 

Alternatively, the p -value is Po|A' > x}, where x is the observed value of A'. Note 
that, under £ = 0, the distribution of p is given by 

P 0 {p <u} = P 0 {1 -$( — )<«} = Po{$( —) > 1 - u} = u , 

CT (7 

because <&{X/a) is uniformly distributed on (0,1) (see Problem 3.22); therefore, 
p is uniformly distributed on (0,1). ■ 

A general property of p-values is given in the following lemma, which applies 
to both simple and composite null hypotheses. 

Lemma 3.3.1 Suppose X has distribution Pe for some 9 C 12, and the null 
hypothesis H specifies 9 £ 12 h- Assume the rejection regions satisfy (3.10). 

(i) V 

sup Pg (A £ Sa} < a for all 0 < a < 1, (3-12) 

O^CIh 

then the distribution of p under 9 £ 12 h satisfies 

Pg{p < u} < u for all 0 < u < 1 . (3.13) 

(n) If, for 0 £ 1 l H , 

Pg{X £ S a } = a for all 0 < a < 1 , (3-14) 

then 

Pg{p < u} = u for all 0 < u < 1 ; 
i.e. p is uniformly distributed over (0,1). 

Proof, (i) If 9 £ 12 h, then the event {p < u} implies {A £ for all u < v. 
The result follows by letting v —> u. 

(ii) Since the event {X £ S u } implies { p < u}, it follows that 

Pg{p < u} > Pe{A £ Su} . 

Therefore, if (3.14) holds, then Pe|j5 < m} > u, and the result follows from (i). ■ 


5 One could generalize the definition of p -value to include randomized level a tests <j> a 
assuming that they are nested in the sense that falx) < f a i(x) for all x and a < a'. 
Simply define p = inf{a : (j> a {X) = 1}; in words, p is the smallest level of significance 
where the hypothesis is rejected with probability one. 
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Example 3.3.2 Suppose X takes values 1,2,..., 10. Under H, the distribution 
is uniform, i.e., po{j) = for j = 1, ..., 10. Under K, suppose pi{j) = j/ 55. 
The MP level a = i/10 test rejects if X > 11 — i. However, unless a is a multiple 
of 1/10, the MP level a test is randomized. If we want to restrict attention to 
nonrandomized procedures, consider the conservative approach by defining 

S a = {X > 11 — i} if^<«<^. 

If the observed value of A' is x, then the p-value is given by (11 — *)/10. Then, 
the distribution of p under H is given by 

P{p < u} = P{ < u} = P{X > 11 — 10m} < u , (3.15) 

and the last inequality is an equality if and only if u is of the form i/10 for some 
integer i = 0,1,..., 10, i.e. the levels for which the MP test is nonrandomized 
(Problem 3.21). ■ 

P-values, with the additional information they provide, are typically more 
appropriate than fixed levels in scientific problems, whereas a fixed predetermined 
a is unavoidable when acceptance or rejection of H implies an imminent concrete 
decision. A review of some of the issues arising in this context, with references to 
the literature, is given in Kruskal (1978). 


3.4 Distributions with Monotone Likelihood Ratio 

The case that both the hypothesis and the class of alternatives are simple is 
mainly of theoretical interest, since problems arising in applications typically 
involve a parametric family of distributions depending on one or more parameters. 
In the simplest situation of this kind the distributions depend on a single real¬ 
valued parameter 9, and the hypothesis is one-sided, say H : 9 < 9q. In general, 
the most powerful test of H against an alternative #1 > 80 depends on 9\ and is 
then not UMP. However, a UMP test does exist if an additional assumption is 
satisfied. The real-parameter family of densities pe(x) is said to have monotone 
likelihood ratio 6 if there exists a real-valued function T(x) such that for any 
9 < 9' the distributions Pg and Pgi are distinct, and the ratio pg' (x)/pg(x) is a 
nondecreasing function of T(x). 


Theorem 3.4.1 Let 8 be a real parameter, and let the random variable X have 
probability density pg{x) with monotone likelihood ratio in T(x). 


(i) For testing H : 9 < 9q against K : 8 > 

8 0 , there exists a 

UMP test, which 

is given by 




f 1 

when 

T(x) > C, 


<j>(x) = < 7 

when 

T(x) = C, 

(3.16) 

l 0 

when 

T{x) < C, 



6 This definition is in terms of specific versions of the densities pg. If instead the 
definition is to be given in terms of the distribution Pg, various null-set considerations 
enter which are discussed in Pfanzagl (1967). 
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where C and 7 are determined by 

Ee 0 <t>(X) = a. (3.17) 

(ii) The power function 

(3(9) = E e <t>(X) 

of this test is strictly increasing for all points 9 for which 0 < (3(6) < 1. 

(iii) For all 9', the test determined by (3.16) and (3.17) is UMP for testing 
H 1 : 9 < 9 1 against K' : 9 > 9' at level a' = (3(9'). 

(iv) For any 9 < 9 q the test minimizes (3(6) (the probability of an error of the 
first kind) among all tests satisfying (3.17). 

Proof, (i) and (ii): Consider first the hypothesis FI 0 : 9 = 9q and some simple 
alternative 9\ > 9q. The most desirable points for rejection are those for which 
r(x) = pg 1 (x)/pg 0 (x) = g[T(x)\ is sufficiently large. If T(x) < T(x'), then r(x) < 
r(x') and x' is at least as desirable as x. Thus the test which rejects for large 
values of T(x) is most powerful. As in the proof of Theorem 3.2.1 (i), it is seen that 
there exist C and 7 such that (3.16) and (3.17) hold. By Theorem 3.2.1 (ii), the 
resulting test is also most powerful for testing Pgi against Pe" at level a' = (3(6') 
provided 9' < 9". Part (ii) of the present theorem now follows from Corollary 
3.2.1. Since (3(6) is therefore nondecreasing the test satisfies 

Ee4>(X) < a for 9 < 9o- (3.18) 

The class of tests satisfying (3.18) is contained in the class satisfying Eg 0 (j)(X) < 
a. Since the given test maximizes (3(6i) within this wider class, it also maximizes 
(3(9\) subject to (3.18); since it is independent of the particular alternative 9\ > 6q 
chosen, it is UMP against K. 

(iii) is proved by an analogous argument. 

(iv) follows from the fact that the test which minimizes the power for testing 
a simple hypothesis against a simple alternative is obtained by applying the 
fundamental lemma (Theorem 3.2.1) with all inequalities reversed. 

By interchanging inequalities throughout, one obtains in an obvious manner 
the solution of the dual problem, H : 9 > do, K : 9 < 9q. ■ 

The proof of (i) and (ii) exhibits the basic property of families with monotone 
likelihood ratio: every pair of parameter values do < 9\ establishes essentially 
the same preference order of the sample points (in the sense of the preceding 
section). A few examples of such families, and hence of UMP one-sided tests, 
will be given below. However, the main applications of Theorem 3.4.1 will come 
later, when such families appear as the set of conditional distributions given a 
sufficient statistic (Chapters 4 and 5) and as distributions of a maximal invariant 
(Chapters 6 and 7). 


Example 3.4.1 (Hypergeometric) From a lot containing N items of a man¬ 
ufactured product, a sample of size n is selected at random, and each item in the 
sample is inspected. If the total number of defective items in the lot is D, the 
number X of defectives found in the sample has the hyper geometric distribution 


P{X = x} = P D (x) 




N-D 
n — x . 


(N\ 


max(0, n + D — N ) < x < min(n, D ). 
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Interpreting Pd ( x ) as a density with respect to the measure p that assigns to any 
set on the real line as measure the number of integers 0 , 1 , 2 ,... that it contains, 
and nothing that for values of x within its range 

Pd+i (*) f Tpb N d+i-* X if n + D + 1 — N<x<D, 

Pd(x) \ 0 or oo if x = n + D —NotD + 1, 

it is seen that the distributions satisfy the assumption of monotone likelihood 
ratios with T(x) = x. Therefore there exists a UMP test for testing the hypothesis 
H : D < Do against K : D > Do, which rejects H when X is too large, and an 
analogous test for testing H' : D > Do. ■ 

An important class of families of distributions that satisfy the assumptions of 
Theorem 3.4.1 are the one-parameter exponential families. 

Corollary 3.4.1 Let 8 be a real parameter, and let X have probability density 
(with respect to some measure p) 

pg( x) = C(8)e Q< ' e ^ T ^ h(x), (3.19) 

where Q is strictly monotone. Then there exists a UMP test <j> for testing H : 8 < 
9o against K : 9 > 9 o. If Q is increasing, 

<j>(x) = 1,7,0 as T(x)>,=,<C, 

where C and 7 are determined by Eg 0 <j>( A') = a. IfQ is decreasing, the inequalities 
are reversed. 

A converse of Corollary 3.4.1 is given by Pfanzagl (1968), who shows under 
weak regularity conditions that the existence of UMP tests against one-sided 
alternatives for all sample sizes and one value of a implies an exponential family. 

As in Example 3.4.1, we shall denote the right-hand side of (3.19) by Pe(x) 
instead of pe(x) when it is a probability, that is, when A' is discrete and p is 
counting measure. 

Example 3.4.2 (Binomial) The binomial distributions b(p,n) with 

p P w=(j/ (i - P r 

satisfy (3.19) with T(x) = x,8 — p, Q(p) = log[p/(l — p)]. The problem of testing 
H : p > po arises, for instance, in the situation of Example 3.4.1 if one supposes 
that the production process is in statistical control, so that the various items 
constitute independent trials with constant probability p of being defective. The 
number of defectives A' in a sample of size n is then sufficient statistic for the 
distribution of the variables A f (i = 1 ,..., n), where A f is 1 or 0 as the ith item 
drawn is defective or not, and A' is distributed as b(p,n). There exists therefore 
a UMP test of H , which rejects H when A' is too small. 

An alternative sampling plan which is sometimes used in binomial situations 
is inverse binomial sampling. Here the experiment is continued until a speci¬ 
fied number m of successes—for example, cures effected by some new medical 
treatment— have been obtained. If Y % denotes the number of trials after the 
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(i — l)st success up to but not including the ith success, the probability that 
Yi = y is pq y for y = 0,1,..., so that the joint distribution of Yi,..., Y m is 

P P (yi,...,ym)=p rn q T ‘ Vi , yk = 0,1,..,fc = 1,..., m. 

This is an exponential family with T(y) = 22 Vi and Q(p) = log(l — p). Since 
Q(p) is a decreasing function of p, the UMP test of H : p < po rejects H when T 
is too small. This is what one would expect, since the realization of m successes 
in only a few more than m trials indicates a high value of p. The test statistic T, 
which is the number of trials required in excess of m to get m successes, has the 
negative binomial distribution [Problem 1.1 (i)] 


m 


(m + t — l\ 

V m ~ 1 ) 



t = o,i,.... ■ 


Example 3.4.3 (Poisson) If Xi,...,X n are independent Poisson variables 
with E(Xi) = A, their joint distribution is 

\ xi-\ - \-X n 

P\{xi,.. ,,x„) = —:- -e~ nX . 

This constitutes an exponential family with T(x) = 22Xi , and Q{ A) = log A. 
One-sided hypotheses concerning A might arise if A is a bacterial density and 
the .Y’s are a number of bacterial counts, or if the X's denote the number of 
a-particles produced in equal time intervals by a radioactive substance, etc. The 
UMP test of the hypothesis A < Ao rejects when 22 Xi is too large. Here the test 
statistic 22 Xi has itself a Poisson distribution with parameter nX. 

Instead of observing the radioactive material for given time periods or counting 
the number of bacteria in given areas of a slide, one can adopt an inverse sampling 
method. The experiment is then continued, or the area over which the bacteria 
are counted is enlarged, until a count of m has been obtained. The observations 
consist of the times Ti,..., T m that it takes for the first occurrence, from the 
first to the second, and so on. If one is dealing with a Poisson process and the 
number of occurrences in a time or space interval r has the distribution 

(A t) x _at n . 

P{x)= - T—e , * = 0 , 1 ,..., 

x\ 

then the observed times are independently distributed, each with the exponential 
density Ae“ * for t > 0 [Problem l.l(ii)]. The joint densities 

( m 

-xy^ti 

i= 1 

form an exponential family with T(ti,... ,t m ) = 22^i and Q( A) = —A. The UMP 
test of H : A < Ao rejects when T = 22P is too small. Since 2A Ti has density 
\e~ u ^ 2 for u > 0 , which is the density of a \ 2 -distribution with 2 degrees of 
freedom, 2A T has a x 2 -distribution with 2m degrees of freedom. The boundary 
of the rejection region can therefore be determined from a table of \ 2 ■ ■ 

The formulation of the problem of hypothesis testing given at the beginning 
of the chapter takes account of the losses resulting from wrong decisions only 
in terms of the two types of error. To obtain a more detailed description of the 
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problem of testing H : 9 < do against the alternatives 6 > 9q, one can consider 
it as a decision problem with the decisions do and d\ of accepting and rejecting 
H and a loss function L(9,di) = Li(9). Typically, Lo{9) will be 0 for 9 < 9o and 
strictly increasing for 9 > 9o, and L\{9) will be strictly decreasing for 9 < 9o and 
equal to 0 for 9 > 8q. The difference then satisfies 

Li(6 l ) — Lo{9) i> 0 as 8 S Oo- (3.20) 

The following theorem is a special case of complete class results of Karlin and 
Rubin (1956) and Brown, Cohen, and Strawderman (1976). 

Theorem 3.4.2 (i) Under the assumptions of Theorem 3.f.l, the family of 
tests given by (3.16) and (3.17) with 0 < a < 1 is essentially complete provided 
the loss function satisfies (3.20). 

(ii) This family is also minimal essentially complete if the set of points x for 
which pe(x) > 0 is independent of 9. 

Proof, (i): The risk function of any test (j> is 

R{9,4>) = Jpe(x){<l)(x)L 1 (9) + [1 - (f>(x)]L 0 (9)} dp(x) 

= Jp e {x){L 0 {9) + [L\{9) - L 0 {6)](f>(x)} dp{x), 
and hence the difference of two risk functions is 

R(9, <t>) — R{9, <j>) = [Li(9) - L 0 (9)] j (</»'- <t>) P e dp. 

This is < 0 for all 9 if 

/3<p{0) - p^{0) = J(<{>' - <j>)pe dp ^ 0 for 9 ^ 9 0 . 

Given any test <j>, let Eg 0 (p(X ) = a. It follows from Theorem 3.4.1 (i) that there 
exists a UMP level-a test <f>' for testing 9 = 9o against 9 > #o, which satisfies 
(3.16) and (3.17). By Theorem 3.4.1(iv), <j> also minimizes the power for 9 < 9q. 
Thus the two risk functions satisfy R(9 , cj>') < R{9, </>) for all 9, as was to be 
proved. 

(ii): Let </> a and <j> a i be of sizes a < a' and UMP for testing 9o against 8 > 9q. 
Then /3</> a (9) < /3$ a ,(9) for all 8 > 9o unless P<j> a (9) = 1. By considering the 
problem of testing 9 = 8o against 8 < 9o it is seen analogously that this inequality 
also holds for all 9 < 8o unless j3$ , ( 9 ) = 0. Since the exceptional possibilities are 
excluded by the assumptions, it follows that R(9, (/>') <. R(9, rf) as 9 i> 9o- Hence 
each of the two risk functions is better than the other for some values of 9. 

The class of tests previously derived as UMP at the various significance levels 
a is now seen to constitute an essentially complete class for a much more general 
decision problem, in which the loss function is only required to satisfy certain 
broad qualitative conditions. From this point of view, the formulation involving 
the specification of a level of significance can be considered a simple way of 
selecting a particular procedure from an essentially complete family. 

The property of monotone likelihood ratio defines a very strong ordering of a 
family of distributions. For later use, we consider also the following somewhat 
weaker definition. A family of cumulative distribution functions Fg on the real line 
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is said to be stochastically increasing (and the same term is applied to random 
variables possessing these distributions) if the distributions are distinct and if 
9 < 9' implies Fg(x) > Fgi(x) for all x. If then X and X' have distributions Fg 
and Fg respectively, it follows that P{X > x} < P{X' > x} for all x, so that 
X' tends to have larger values than X. In this case the variable X' is said to 
be stochastically larger than X. This relationship is made more intuitive by the 
following characterization of the stochastic ordering of two distributions. ■ 

Lemma 3.4.1 Let Fo and T\ be two cumulative distribution functions on the real 
line. Then Fi(x) < Fo(x) for all x if and only if there exist two nondecreasing 
functions fo and fi, and a random variable V, such that (a) fo(v) < fi(v) for 
all v, and (b) the distributions of fo(V) and fi(V) are Fo and Fi respectively. 

Proof. Suppose first that the required /o,/i and V exist. Then 

fiW = P{fi(V) <x}< P{f 0 (V) <x} = F 0 (x) 

for all x. Conversely, suppose that Fi(x) < Fo(x) for all x, and let fi(y) = inf {a: : 
Fi(x — 0) < y < T\(a:)}, i = 0,1. These functions are nondecreasing and for 
fi = f,Fi = F satisfy 

f[F(x)] < x and F[f(y)] > y for all x and y. 

It follows that y < F[x o) implies f(y) < f[F(x o)] < xo and that conversely 
f(y) < xo, implies F[f(y)] < F(x o)] and hence y < F(xo), so that the two in¬ 
equalities f(y) < xo and y < F(x o) are equivalent. Let V be uniformly distributed 
on (0,1). Then P{fi(V) < x} = P{V < Fi(x)} = F(x). Since F t (x) < F 0 {x) for 
all x implies fo{y) < fi(y) for all y, this completes the proof. ■ 

One of the simplest examples of a stochastically ordered family is a location 
parameter family, that is, a family satisfying 

F e {x) = F(x — 6). 

To see that this is stochastically increasing, let X be a random variable with 
distribution F(x). Then 9 < 6' implies 

F{ x - 9) = P{x < x - 9} > P{X <x-9'} = F(x - 9'), 

as was to be shown. 

Another example is finished by families with monotone likelihood ratio. This is 
seen from the following lemma, which establishes some basic properties of these 
families. 

Lemma 3.4.2 Let pe(x) be a family of densities on the real line with monotone 
likelihood ratio in x. 

(i) If ip is a nondecreasing function of x, then Egip(X) is a nondecreasing 
function of 9; if X i,..., X n are independently distributed with density pg and ip' 
is a function of xi,... ,x„ which is nondecreasing in each of its arguments, then 
Egip'(X i,..., X n ) is a nondecreasing function of 9. 

(ii) For any 9 < 9', the cumulative distribution functions of X under 9 and 9' 
satisfy 


Fgi(x) < Fg(x) for all x. 
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(iii) Let ip be a function with a single change of sign. More specifically, suppose 
there exists a value xo such that ip(x) < 0 for x < xo and ip(x) > 0 for x > xo- 
Then there exists do such that Egip(X) < 0 for 6 < do and Egip(X) > 0 for 
d > do, unless Egip(X) is either positive for all d or negative for all d. 

(iv) Suppose that pe(x) is positive for all d and all x, that pe 1 (x)/pg(x) is 
strictly increasing in x for d < d', and that ip(x) is as in (iii) and is ^ 0 with 
positive probability. If Eg 0 ip{X) = 0, then Egip{X) < 0 for d < do and > 0 for 
d > do- 


Proof, (i): Let d < O', and let A and B be the sets for which pg'(x) < pg[x) and 
pg'{x) > pg(x) respectively. If a = sup A ip(x) and b = infs ip(x), then b — a > 0 
and 


/ ip(Pe ' — pg) dp > a ( p g > -pg)dp + b / (p e i - pg) dp 

J J A J B 

= ( b ~a) (Pe> — Pe) dp > 0, 

J B 


which proves the first assertion. The result for general n follows by induction. 

(ii) : This follows from (i) by letting ip(x) = 1 for x > xo and ip(x) = 0 
otherwise. 

(iii) : We shall show first that for any O' < 0" , Egiip(X) > 0 implies Egnip(X) > 
0. If pg"{xo)/pg' (xo) = oo, then pg>(x) = 0 for x > xo and hence Egiip(X) < 0. 
Suppose therefore that pg"(xo)/pg'(xo) = c < oo. Then ip(x) > 0 on the set 
S = {x : pg>(x) = 0 and pg"(x) > 0}, and 


Eg„iP(X) > / ip^-p e ,dp 

Js Pe' 

/ XQ — p OO 

CIppg’dpA / Clppgi dp = cEgilp(X) > 0. 

-OO jXQ 

The result now follows by letting do = inf{# : Egip(X ) > 0}. 

(iv): The proof is analogous to that of (iii). ■ 

Part (ii) of the lemma shows that any family of distributions with monotone 
likelihood ratio in x is stochastically increasing. That the converse does not hold 
is shown for example by the Cauchy densities 

1 1 

7T 1 + (X — d) 2 


The family is stochastically increasing, since d is a location parameter; however, 
the likelihood ratio is not monotone. Conditions under which a location parameter 
family possesses monotone likelihood ratio are given in Example 8.2.1. 

Lemma 3.4.2 is a special case of a theorem of Karlin (1957, 1968) relating the 
number of sign changes of Egip(X) to those of ip(x) when the densities pg{x) are 
totally positive (defined in Problem 3.50). The application of totally positive- 
or equivalently, variation diminishing-distributions to statistics is discussed by 
Brown, Johnstone, and MacGibbon (1981); see also Problem 3.53. 
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3. Uniformly Most Powerful Tests 


3.5 Confidence Bounds 

The theory of UMP one-sided tests can be applied to the problem of obtaining 
a lower or upper bound for a real-valued parameter 9. The problem of setting a 
lower bound arises, for example, when 9 is the breaking strength of a new alloy; 
that of setting an upper bound, when 9 is the toxicity of drug or the probability 
of an undesirable event. The discussion of lower and upper bounds completely 
parallel, and it is therefore enough to consider the case of a lower bound, say 9. 

Since # = f?(A') will be a function of the observations, it cannot be required to 
fall below 9 with certainty, but only with specified high probability. One selects a 
number 1 — a, the confidence level, and restricts attention to bounds 9 satisfying 

Pg{9(X) <9}> 1-a for all 9. (3.21) 

The function 6 is called a lower confidence bound for 9 at confidence level 1 — a: 
the infimum of the left-hand side of (3.21), which in practice will be equal to 
1 — a, is called the confidence coefficient of 9. 

Subject to (3.21), 9 should underestimate 9 by as little as possible. One can 
ask, for example, that the probability of 9 falling below any 9' < 9 should be a 
minimum. A function 9 for which 

Pg{9(X) < 9'} = minimum (3.22) 

for all 9' < 9 subject to (3.21) is a uniformly most accurate lower confidence 
bound for 9 at confidence level 1 — a. 

Let L{9, 9) be a measure of the loss resulting from underestimating 9, so that 
for each fixed 9 the function L(9,9 ) is defined and nonnegative for 9_<9, and is 
nonincreasing in this second argument. One would then wish to minimize 

E 0 L(9,O) (3.23) 

subject to (3.21). It can be shown that a uniformly most accurate lower confidence 
bound (9 minimizes (3.23) subject to (3.21) for every such loss function L. (See 
Problem 3.44.) 

The derivation of uniformly most accurate confidence bounds is facilitated by 
introducing the following more general concept, which will be considered in more 
detail in Chapter 5. A family of subsets S(x) of the parameter space 12 is said to 
constitute a family of confidence sets at confidence level 1 — a if 

Pe{9 G S(X)} > 1 -a for all 9 € fl, (3.24) 

that is, if the random sets S(X) covers the true parameter point with probability 
> 1 — a. A lower confidence bound corresponds to the special case that S(x) is 
a one-sided interval 

S(x) = {9 : 9(x) < 9 < oo}. 

Theorem 3.5.1 (i) For each 9o £ 12 let A(9 q) be the acceptance region of a 
level-a test for testing H{9fi) : 9 = 9o, and for each sample point x let S(x) 
denote the set of parameter values 

S(x) = {9: a: G A(9),9 € 12}. 

Then S(x ) is a family of confidence sets for 9 at confidence level 1 — a. 
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(ii) If for all 9o,A(9o ) is UMP for testing H(9 q) at level a against the 
alternatives K(9o), then for each 9o (f Ii,S(X) minimizes probability 

Pe{9 0 £ S(X)} for all 9 € K(9 0 ) 

among all level 1 — a families of confidence sets for 9. 

Proof, (i): By definition of S(x), 

9 £ S(x) if and only if x £ A(9), (3.25) 

and hence 

Pe{6 £ S(X)} = P e {X £ A{9)} > 1 - a. 

(ii): If S*(x) is any other family of confidence sets at level 1 —a, and if A*(6) = 
{x : 9 £ S*(x)}, then 

Pg{X £ A*(9)} = Pe{9 £ S* {X)} > 1 - a, 

so that A*(0o) is the acceptance region of a level-a test of H(9o). It follows from 
the assumed property of A(9o) that for any 9 £ K{9q) 

Pe{X £ A*(9 0 )} > P e {X £ A{9 0 )} 

and hence that 

Pe{9 0 £ S*(X)} > P e {9 0 £ S(X)}, 

as was to be proved. ■ 

The equivalence (3.25) shows the structure of the confidence sets S(x) as the 
totality of parameter values 9 for which the hypothesis H ( 9 ) is accepted when x 
is observed. A confidence set can therefore be viewed as a combined statement 
regarding the tests of the various hypotheses H(9), which exhibits the values for 
which the hypothesis is accepted [9 £ S(*)] and those for which it is rejected 
[0&S(x)]. ' 

Corollary 3.5.1 Let the family of densities pg(x),9 £ Q,, have monotone likeli¬ 
hood ratio in T(x), and suppose that the cumulative distribution function Fg(t) 
of T = T(X) is a continuous function in each of the variables t and 9 when the 
other is fixed. 

(i) There exists a uniformly most accurate confidence bound 9 for 9 at each 
confidence level 1 — a. 

(ii) If x denotes the observed values of X and t = T(x), and if the equation 

F e (t) = 1-a (3.26) 

has a solution 9 = 9 in LI then this solution is unique and 9(x) = 9. 

Proof, (i): There exists for each #o a constant C(9o) such that 

Pg 0 {T > C(9o)} = a, 

and by Theorem 3.4.1, T > C{9 o) is a UMP level-a rejection region for testing 
9 = 9o against 9 > 9q. By Corollary 3.2.1, the power of this test against any 
alternative 9\ > 9o exceeds a , and hence C{9q) < C{9\) so that the function C is 
strictly increasing; it is also continuous. Let A(9q) denote the acceptance region 
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T < C(9o), and let S(x) be defined by (3.25). If follows from the monotonicity 
of the function C that S(x) consists of those values 6 £ which satisfy 6 < 9, 
where 


6 = inf{6> : T{x) < C{9)}. 

By Theorem 3.5.1, the sets {6 : 8(x) < 9}, restricted to possible values of the 
parameter, constitute a family of confidence sets at level 1 — a, which minimize 
Pg{9 < 9 '} for all 9 € K(9'), that is, for all 9 > 9'. This shows 9 to be a uniformly 
most accurate confidence bound for 9. 

(ii): It follows from Corollary 3.2.1 that F$(t) is a strictly decreasing function 
of 9 at any point t for which 0 < F${t) < 1, and hence that (3.26) can have at 
most one solution. Suppose now that t is the observed value of T and that the 
equation Fg(t) = 1 — a has the solution 9 G fl. Then Fg(t) = 1 — a, and by 
definition of the function C, C(9) = t. The inequality t < C(9 ) is then equivalent 
to C[9) < C(9) and hence to 9 < 9. It follows that 6 — 0, as was to be proved. 

Under the same assumptions, the corresponding upper confidence bound with 
confidence coefficient 1 — a is the solution 9 of the equation Pg{T > t} — 1 — a 
or equivalently of Fg(t) = a. ■ 

Example 3.5.1 (Exponential waiting times) To determine an upper bound 
for the degree of radioactivity A of a radioactive substance, the substance is 
observed until a count of m has been obtained on a Geiger counter. Under the 
assumptions of Example 3.4.3, the joint probability density of the times Ti(i = 
1 ,..., m) elapsing between the (i — l)st count and the ith one is 

p(ti,. ■ ■ ,t m ) = X m e~ x ^ ti , ti,...,t m >0. 

If T = £7) denotes the total time of observation, then 2A T has a ^-distribution 
with 2 m degrees of freedom, and, as was shown in Example 3.4.3, the acceptance 
region of the most powerful test of H( Ao) : A = Ao against A < Ao is 2Ao T < C, 
where C is determined by the equation 

X2m f ^ • 

The set defined by (3.25) is then the set of values A such that 

A < C/2T, and it follows from Theorem 3.5.1 that A = C/2T is a uniformly most 
accurate upper confidence bound for A. This result can also be obtained through 
Corollary 3.5.1. ■ 

If the variables X or T are discrete, Corollary 3.5.1 cannot be applied directly, 
since the distribution functions Fg(t) are not continuous, and for most values #o 
the optimum test of H : 9 = 9o are randomized. However, any randomized test 
based on A' has the following representation as a nonrandomized test depending 
on X and an independent variable U distributed uniformly over (0,1). Given a 
critical function <j>, consider the rejection region 

R = {(*,«) : u < <j>{x )}. 



Then 


P{(X, U) e R} = P{U < 0(A)} = X), 
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whatever the distribution of X, so that R has the same power function as <j> and 
the two tests are equivalent. The pair of variables (A', U ) has a particularly simple 
representation when A' is integer-valued. In this case the statistic 

T = X + U 

is equivalent to the pair (A, U), since with probability 1 
X = [T\, U = T-[T \, 

where [T] denotes the largest integer < T. The distribution of T is continuous, 
and confidence bounds can be based on this statistic. 


Example 3.5.2 (Binomial) An upper bound is required for a binomial proba¬ 
bility p —for example, the probability that a batch of polio vaccine manufactured 
according to a certain procedure contains any live virus. Let AT,..., X„ denote 
the outcome of n trials, X; being 1 or 0 with probabilities p and q respectively, 
and let A' = JZ Xj. Then T = X + U has probability density 

0 < t < n + 1. 

This satisfies the conditions of Corollary 3.5.1, and the upper confidence bound 
p is therefore the solution, if it exists, of the equation 

P P {T <t} = a, 

where t is the observed value of T. A solution does exist for all values a < 
t < n + a. For n + a < t, the hypothesis H(po) : p = po is accepted against 
the alternative p < po for all values of po and hence p = 1. For t < a,H(po) 
is rejected for all values of po and the confidence set S(t) is therefore empty. 
Consider instead the sets S*(t) which are equal to S(t) for t > a and which for 
t < a consist of the single point p = 0. They are also confidence sets at level 
1 — a, since for all p, 

P P {p G S*(T)} > P p {p G S(T)} = l-a. 

On the other hand, P p {p' G S*(T)} — P p {p' G S(T)} for all p' > 0 and hence 

P p {p G S*{T)} = P p {p' G S(T)} for all p' > p. 

Thus the family of sets S* ( t ) minimizes the probability of covering p' for all p' > p 
at confidence level 1 — a. The associated confidence bound p* ( t ) = p(t) for t > a 
and p* ( t ) = 0 for t < a is therefore a uniformly most accurate upper confidence 
bound for p at level 1 — a. 

In practice, so as to avoid randomization and obtain a bound not dependent on 
the extraneous variable U, one usually replaces T by A' + 1 = [T] + 1. Since p*(t) 
is a nondecreasing function of t, the resulting upper confidence bound p*([t] + 
1) is then somewhat larger than necessary; as a compensation it also gives a 
correspondingly higher probability of not falling below the true p. 

References to tables for the confidence bounds and a careful discussion of 
various approximations can be found in Hall (1982) and Blytli (1984). Large 
sample approaches will be discussed in Example 11.2.7. ■ 
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Let 9 and 9 be lower and upper bounds for 9 with confidence coefficients 1 — an 
and l—a 2 , and suppose that 9(x) < 9(x ) for all x. This will be the case under 
the assumptions of Corollary 3.5.1 if ai + o 2 < 1. The intervals (9,9) are then 
confidence intervals for 9 with confidence coefficient 1 — an — o 2 \ that is, they 
contain the true parameter value with probability 1 — on — 02 , since 

Pg{9 < 9 < 9} — l — ai—a 2 for all 9. 

If 9 and 9 are uniformly most accurate, they minimize EgL\(9,ff) and EgL2(9,9) 
at their respective levels for any function L\ that is nonincreasing in 9 for 9 < 9 
and 0 for (9 > 9 and any L 2 that is nondecreasing in 9 for 9 > 9 and 0 for 9 < 9. 
Letting 


L(9;9,9) = L 1 (9,9) + L 2 (9,9), 
the intervals (9,9) therefore minimize EgL(9; 9, 9) subject to 


Pg{9 > 9} < ai, 

An example of such a loss function is 


L(9\ 9,9) = { 9-9 


Pe{9 < 9} < a 2 . 


if 9 <9 <9, 
if 9 < 9, 
if 9 <9, 


which provides a natural measure of the accuracy of the intervals. Other possible 
measures are the actual length 9 — 9 of the intervals, or, for example, a(9 — 9 _) 2 + 
b(9 — 9) 2 , which gives an indication of the distance of the two end points form 
the true value . 7 

An important limiting case corresponds to the levels a\= a 2 = \. Under the 
assumptions of Corollary 3.5.1 and if the region of positive density is independent 
of 9 so that tests of power 1 are impossible when a < 1, the upper and lower 
confidence bounds 9 and 9 coincide in this case. The common bound satisfies 

P g {9<9} = Pe{9>9}= i, 

and the estimate 9 of 9 is therefore as likely to underestimate as to overestimate 
the true value. An estimate with this property is said to be median unbiased. (For 
the relation of this to other concepts of unbiasedness, see Problem 1.3.) It follows 
from the above result for arbitrary ai and a 2 that among all median unbiased 
estimates, 9 minimizes EL(9,9) for any monotone loss function, that is, any loss 
function which for fixed 9 has a minimum of 0 at 9 = 9 and is nondecreasing 
as 9 moves away from 9 in either direction. By taking in particular L(9,9) = 0 
when \9 — 9\ < A and = 1 otherwise, it is seen that among all median unbiased 
estimates, 9 minimizes the probability of differing from 9 by more than any given 
amount; more generally it maximizes the probability 


Pe {—Ai < 9 — 9 < A 2 } 

for any Ai, A 2 > 0. 

A more detailed assessment of the position of 9 than that provided by confi¬ 
dence bounds or intervals corresponding to a fixed level 7 = 1 — a is obtained by 


7 Proposed by Wolfowitz (1950). 
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stating confidence bounds for a number of levels, for example upper confidence 
bounds corresponding to values such as 7 = .05, .1, .25, .5, .75, .9, .95. These con¬ 
stitute a set of standard confidence bounds , 8 from which different specific intervals 
or bounds can be obtained in the obvious maimer. 


3.6 A Generalization of the Fundamental Lemma 

The following is useful extension of Theorem 3.2.1 to the case of more than one 
side condition. 


Theorem 3.6.1 Let / 1 ,..., fm+i be real-valued functions defined on a Euclidean 
space X and integrable p, and suppose that for given constants ci,..., c m there 
exists a critical function <f> satisfying 


J 4>fi dp = a, 

Let C be the class of critical functions f> for which (3.27) holds. 
(i) Among all members of C there exists one that maximizes 


(3.27) 


/■ 


0 ,/m +1 dfx. 

(ii) A sufficient condition for a member of C to maximize 

0/ra+l dfl 


/■ 


existence of constants 

k\, . . . , km 

such that 


<t>{x) = 1 

when 

fm+ i(x) > 

m 

^kiffix), 




i= m (3.28) 

4>(x) — o 

when 

fm + l(x) < 

^kifi(x). 
i= 1 

If a member of C satisfies (3.28) with ki,... 

, km >0, then it maximizes 


/■ 


0/ra+l dfl 

among all critical functions satisfying 

J 4>fidp<Ci, i m. (3.29) 

(iv) The set M of points in m-dimensional space whose coordinates are 

<t>fi dp,..., [ <j>fm dp} 


Suggested by Tukey (1949b). 
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for some critical function <j> is convex and closed. If (ci is an inner 
point? of M , then there exist constants ki,..., k m and a test (j> satisfying (3.27) 
and (3.28), and a necessary condition for a member of C to maximize 

J 4 > fm +1 dp. 

is that (3.28) holds a.e. p. 


Here the term “inner point of M” in statement (iv) can be interpreted as 
meaning a point interior to M relative to m-space or relative to the smallest 
linear space (of dimension < m) containing M. The theorem is correct with both 
interpretations but is stronger with respect to the latter, for which it will be 
proved. 

We also note that exactly analogous results hold for the minimization of 

f 4 > fm +1 dp. 

Proof, (i): Let {</>„} be a sequence of functions in C such that f 4> n fm+i dp 
tends to sup^, f <j>fm +1 dp. By the weak compactness theorem for critical functions 
(Theorem 3.4.2 of the Appendix), there exists a subsequence {<j> ni } and a critical 
function (j> such that 

J <t>njkdp ^ J (f>f k dp for k = 1 , • • • , m + 1 . 

It follows that is in C and maximizes the integral with respect to / m +i dp 
within C. 

(ii) and (iii) are proved exactly as was part (ii) of Theorem 3.2.1. 

(iv): That M is closed follows again from the weak compactness theorem, and 
its convexity is a consequence of the fact that if <j >i and <j >2 are critical functions, 
so is af>\ + (1 — a)(f >2 for any 0 < a < 1. If N (see Figure 3.2) is the totality of 
points in (m + l)-dimensional space with coordinates 




where <j> ranges over the class of all critical functions, then N is convex and closed 
by the same argument. Denote the coordinates of a general point in M and N 
by (ui,... ,u m ) and (in,..., it m +i) respectively. The points of N, the first m 
coordinates of which are ci,..., c m , form a closed interval [c*, c**]. 

Assume first that c* < c**. Since (ci, ... ,c m ,c**) is a boundary point of N, 
there exists a hyperplane ]~[ through it such that every point on N lies below or 
on J"J. Let the equation of J"I be 


ra+1 m 

y, kiUi = kid + km+ic**. 

i= 1 i= 1 


Since (ci,..., c m ) is an inner point of M, the coefficient k m +1 7 ^ 0. To see this, 
let c* < c < c**, so that (ci,... c m , c) is an inner point of N. Then there exists a 
sphere with this point as center lying entirely in N and hence below ]”[. It follows 


9 A discussion of the problem when this assumption is not satisfied is given by Dantzig 
and Wald (1951). 
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Figure 3.2. 


that the point (ci,... c m , c) does not lie on and hence that fc m +i ^ 0. We may 
therefore take km +1 = — 1 and see that for any point of N 

m m 

Um+l ^ ^ fciUi ^ C m _|_l ^ ) foiCi. 

i=l i= 1 


That is, all critical functions (j> satisfy 




dfi, 


where <f> is the test giving rise to the point (ci,..., c m , c ). Thus <j> is the 
critical function that maximizes the left-hand side of this inequality. Since the 
integral in question is maximized by putting <j> equal to 1 when the integrand is 
positive and equal to 0 when it is negative, (j> satisfies (3.28) a.e. /j,. 

If c* = c , let (ci,..., c' m ) be any point of M other than (ci, ..., c m ). We shall 
show now that there exists exactly one real number c! such that (ci,..., c' m , c') is 
in N. Suppose to the contrary that (ci,..., c' m , c')and (ci,..., c' m , c') are both in 
N, and consider any point (c'{, ..., c^, c") of N such that (ci,..., c m ) is an inte¬ 
rior point of the line segment joining (ci,..., c' m ) and (c”,..., c((,). Such a point 
exists since (ci,..., c m ) is an inner point of M. Then the convex set spanned by 
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the three points (c'i,., ,, c' m , c'), (c'i,..., c' m , c'), and {<■".... is contained 

in N and contains points (ci,..., c m , c) and (ci,..., c m , c) with c < c, which is a 
contradiction. Since N is convex, contains the origin, and has at most one point 
on any vertical line ui = ci,...-, u m = c' m , it is contained in a hyperplane, 
which passes through the origin and is not parallel to the w m +i-axis. It follows 
that 


/■ 


, m r. 

<t>fm+1 dfl = ^2 ki / <j>fi dp 

i=l ^ 

for all <f>. This arises of course only in the trivial case that 

m 

fm+ 1 ~ 'y ' fci/i a.e. pt, 
i=l 

and (3.28) is satisfied vacuously. ■ 

Corollary 3.6.1 Let pi,... ,p m ,Pm+ i be probability densities with respect to a 
measure p, and let 0 < a < 1. Then there exists a test <j> such that Ei<j>(X) = a 
(i ~ 1,... ,m) and E m+ i<j>(X) > a, unless p m + i = J2iLi kiPi, a.e. p. 

Proof. The proof will be by induction over m. For m = 1 the result reduces to 
Corollary 3.2.1. Assume now that it has been proved for any set of m distributions, 
and consider the case of m + 1 densities pi,... ,p m + 1 - If pi,.. ■ ,p m are linearly 
dependent, the number of Pi can be reduced and the result follows from the 
induction hypothesis. Assume therefore that pi,... ,p m are linearly independent. 
Then for each j = 1 ,... ,m there exist by the induction hypothesis tests <f>j and 
<pj such that Ei<f>j(X) = Ei<f)j( A') = a for all i = 1,... ,j — l,j + 1,... ,m and 
Ej(pj(X) < a < Ej<f)j(X). It follows that the point of m-space for which all m 
coordinates are equal to a is an inner point of M, so that Theorem 3.6.1(iv) is 
applicable. The test rf)(x ) = a is such that Ei(j>(X ) = a for i = 1,..., m. If among 
all tests satisfying the side conditions this one is most powerful, it has to satisfy 
(3.28). Since 0 < a < 1, this implies 


Pm+l — y ' kiPi 
i= 1 


a.e./x, 


as was to be proved. ■ 

The most useful parts of Theorems 3.2.1 and 3.6.1 are the parts (ii), which 
give sufficient conditions for a critical function to maximize an integral subject 
to certain side conditions. These results can be derived very easily as follows by 
the method of undetermined multipliers. 

Lemma 3.6.1 Let Fi ,..., F m +i be real-valued functions defined over a space 
U, and consider the problem of maximizing F m+ i(u) subject to Ffiu) = c; (i = 
l,... ,m). A sufficient condition for a point u° satisfying the side conditions to 
be a solution of the given problem is that among all points of U it maximizes 


Em+i(p) y ' kjFj ( v) 


for some ki,..., k n 
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When applying the lemma one usually carries out the maximization for 
arbitrary fc’s, and then determines the constants so as to satisfy the side 
conditions. 

Proof. If u is any point satisfying the side conditions, then 

m m 

F,n+l(u) - Yi kiFi{u) < Fm+l(u°) - Y WA 

i=1 i=1 

and hence F m+ 1 (u) < Fm+i(u°). 

As an application consider the problem treated in Theorem 3.6.1. Let U be 
the space of critical functions <j>, and let Fi(cj>) = f (ffidp. Then a sufficient 
condition for <j> to maximize i ? m +i(^>), subject to T)(0) = a, is that it maximizes 
F m +i((t>)-Y^kiFi(<f>) = f(fm+i~J2 kifi)<t)dfi. This is achieved by setting (j>(x) = 
1 or 0 as > or < kifi(x). ■ 


3.7 Two-Sided Hypotheses 

UMP tests exist not only for one-sided but also for certain two-sided hypotheses 
of the form 


H : 8 < 0i or 8 > 02 { 8 1 < 6> 2 ). (3.30) 

This problem arises when trying to demonstrate equivalence (or sometimes called 
bioequivalence) of treatments; for example, a new drug may be declared equiva¬ 
lent to the current standard drug if the difference in therapeutic effect is small, 
meaning 9 is a small interval about 0. Such testing problems also occur when 
one wishes to determine whether given specifications have been met concerning 
the proportion of an ingredient in a drug or some other compound, or whether a 
measuring instrument, for example a scale, is properly balanced. One then sets 
up the hypothesis that 9 does not lie within the required limits, so that an error 
of the first kind consists in declaring 9 to be satisfactory when in fact it is not. 
In practice, the decision to accept H will typically be accompanied by a state¬ 
ment of whether 9 is believed to be < 9\ or > 62 - The implications of H are, 
however, frequently sufficiently important so that acceptance will in any case be 
followed by a more detailed investigation. If a manufacturer tests each precision 
instrument before releasing it and the test indicates an instrument to be out of 
balance, further work will be done to get it properly adjusted. If in a scientific 
investigation the inequalities 9 < 9\ and 9 > 92 contradict some assumptions 
that have been formulated, a more complex theory may be needed and further 
experimentation will be required. In such situations there may be only two basic 
choices, to act as if 9i < 9 < 82 or to carry out some further investigation, and 
the formulation of the problem as that of testing the hypothesis F[ may be ap¬ 
propriate. In the present section, the existence of a UMP test of H will be proved 
for one-parameter exponential families. 


Theorem 3.7.1 (i) For testing the hypothesis H : 9 < 9\ or 9 > 92 {9\ < 62 ) 
against the alternatives K : 9i < 9 < 82 in the one-parameter exponential family 
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(3.19) there exists a UMP test given by 


( 1 

when Ci < T(x) < C 2 (Ci < C 2 ), 


<p(x) = < 7 i 

when T(x) = Ci, * = 1 , 2 , 

(3.31) 

t 0 

when T{x) <C\ or> C 2 , 


where the C's and 7 's are 

determined by 



E ei <P(X) = Eg 2 <P(X) = a. 

(3.32) 


(ii) This test minimizes Egcp(X) subject to (3.32) for all 8 < 8 1 and > 82 - 

(iii) For 0 < a < 1 the power function of this test has a maximum at a 
point 80 between 8 \ and 82 and decreases strictly as 8 tends away from do in 
either direction, unless there exist two values ti,t 2 such that Pg{T(X) = £ 1 } + 
Pg{T(X) = £ 2 } = 1 for all 8 . 

Proof, (i): One can restrict attention to the sufficient statistic T = T(X), the 
distribution of which by Lemma 2.7.2 is 

dP e (t) = C{d)e Qm dv(f), 

where Q(8) is assumed to be strictly increasing. Let 8\ < 8' < d 2 , and consider 
first the problem of maximizing Egnp{T) subject to (3.32) with <p(x) = ip[T(x)\. 
If M denotes the set of all points Eg 1 ip(T), Eg 2 ip(T)) as ip ranges over the totality 
of critical functions, then the point (a, a ) is an inner point of M. This follows 
from the fact that by Corollary 3.2.1 the set M contains points (a, ui) and (a, U 2 ) 
with ui < a < U 2 and that it contains all points (u, u) with 0 < u < 1. Hence 
by part (iv) of Theorem 3.6.1 there exist constants fci, fe and test ipo{t) and that 
4>o(x) = ipo[T(x)] satisfies (3.32) and that ipo(t) = 1 when 

fciC(<?i)e Q(Sl)t + k 2 C{ 8 2 )e Q(e2)t < C( 8 ')e Q(e,)t 
and therefore when 

aie blt + a 2 e b2t <1 (61 < 0 < 62 ), 

and ipo(t) = 0 when the left-hand side is > 1. Here the a’s cannot both be < 0, 
since then the test would always reject. If one of the a’s is < 0 and the other 
one is > 0, then the left-hand side is strictly monotone, and the test is of the 
one-sided type considered in Corollary 3.4.1, which has a strictly monotone power 
function and hence cannot satisfy (3.32). Since therefore both a’s are positive, 
the test satisfies (3.31). It follows from Lemma 3.7.1 below that the C’s and 7 ’s 
are uniquely determined by (3.31) and (3.32), and hence from Theorem 3.6.1 (iii) 
that the test is UMP subject to the weaker restriction Eg i ip(T) < a (i = 1,2). 
To complete the proof that this test is UMP for testing H, it is necessary to show 
that it satisfies Egip(T) < a for 8 < 8 \ and 8 > 82 . This follows from (ii) by 
comparison with the test ip(t) = a. 

(ii): Let 8 ' < 8 1 , and apply Theorem 3.6.1(iv) to minimize Egxp(X) subject to 
(3.32). Dividing through by the desired test is seen to have a rejection 

region of the form 

aie blt + a 2 e b2t <1 (&i < 0 < 62 ). 

Thus it coincides with the test ipo{t) obtained in (i). By Theorem 3.6.1(iv) the 
first and third conditions of (3.31) are also necessary, and the optimum test is 
therefore unique provided P{T = Ci} = 0. 
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(iii): Without loss of generality let Q(9) = 9. It follows from (i) and the conti¬ 
nuity of (3(6) = Egtp(X) that either (3(9) satisfies (iii) or there exist three points 
9' < 9" < 6Y" such that (3(9") < (3(9') = /3(9"') = c, say. Then 0 < c < 1, 
since (3(9') = 0 (or 1) implies (p(t ) = 0 (or 1) a.e. v and this is excluded by 
(3.32). As is seen by the proof of (i), the test minimizes Egn(p(X) subject to 
Egup(X) = Egnnp(X) = c for all 9' < 9" < 9"'. However, unless T takes on at 
most two values with probability 1 or all 9,pg> i pgn j pg>n are linearly independent, 
which by Corollary 3.6.1 implies (3(9") > c. ■ 

In order to determine the C’s and 7’s, one will in practice start with some trial 
values C f,7i, find Cl,72 such that (3*(9 1) = a, and compute (3* (#2), which will 
usually be either too large or too small. For the selection of the next trial values 
it is then helpful to note that if (3* (#2) < ot, the correct acceptance region is to 
the right of the one chosen, that is, it satisfies either Ci > C* or Ci = Cf and 
7 i < 7 i, and that the converse holds if (3* (#2) > ot. This is a consequence of the 
following lemma. 

Lemma 3.7.1 Let pg(x) satisfy the assumptions of Lemma 3-4.2(iv). 

(i) If tj> and tp* are two tests satisfying (3.31) and Eg 1 tp(T) = Eg 1 tp*(T), and 
if tp* is to the right of <j>, then (3(9) < or > (3* (9) as 9 > 9\ or < 9\. 

(ii) If tp and tp* satisfy (3.31) and (3.32), then cp = <p* with probability one . 

Proof, (i): The result follows from Lemma 3.4.2(iv) with ip = <p* — <p. (ii): Since 
Eg x tp(T ) = Eg 1 <p* (T), <p* lies either to the left or the right of <p, and application 
of (i) completes the proof. 

Although a UMP test exists for testing that 9 < #1 or > #2 in an exponential 
family, the same is not true for the dual hypothesis H : 9i < 9 < #2 or for testing 
9 = 9o (Problem 3.54). There do, however, exist UMP unbiased tests of these 
hypotheses, as will be shown in Chapter 4. ■ 


3.8 Least Favorable Distributions 

It is a consequence of Theorem 3.2.1 that there always exists a most powerful 
test for testing a simple hypothesis against a simple alternative. More generally, 
consider the case of a Euclidean sample space; probability densities fg,9 G u>, 
and g with respect to a measure /r; and the problem of testing H : fg,9 € 
against the simple alternative K : g. The existence of a most powerful level a test 
then follows from the weak compactness theorem for critical functions (Theorem 
A.5.1 of the Appendix) as in Theorem 3.6.1 (i). 

Theorem 3.2.1 also provides an explicit construction for the most powerful test 
in the case of a simple hypothesis. We shall now extend this theorem to composite 
hypotheses in the direction of Theorem 3.6.1 by the method of undetermined 
multipliers. However, in the process of extension the result becomes much less 
explicit. Essentially it leaves open the determination of the multipliers, which 
now take the form of an arbitrary distribution. In specific problems this usually 
still involves considerable difficulty. 

From another point of view the method of attack, as throughout the theory of 
hypothesis testing, is to reduce the composite hypothesis to a simple one. This 
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is achieved by considering weighted averages of the distributions of H. The com¬ 
posite hypothesis H is replaced by the simple hypothesis Ha that the probability 
density of A' is given by 

h\(x) = J fe(x)dA(9), 

where A is a probability distribution over u>. The problem of finding a suitable 
A is frequently made easier by the following consideration. Since H provides no 
information concerning 9 and since Ha is to be equivalent to H for the purpose 
of testing against g, knowledge of the distribution A should provide as little help 
for this task as possible. To make this precise suppose that 9 is known to have a 
distribution A. Then the maximum power /3a that can be attained against g is 
that of the most powerful test </>a for testing Ha against g. The distribution A is 
said to be least favorable (at level a) if for all A' the inequality (3a < Pa' holds. 


Theorem 3.8.1 Let a a-field be defined over u> such that the densities fe(x) 
are jointly measurable in 9 and x. Suppose that over this a-field there exist a 
probability distribution A such that the most powerful level-a test (/>a for testing 
Ha against g is of size < a also with respect to the original hypothesis H. 

(i) The test cj >a is most powerful for testing H against g. 

(ii) If <f> a is the unique most powerful level-a for testing Ha against g, it is 
also the unique most powerful test of H against g. 

(iii) The distribution A is least favorable. 


Proof. We note first that Ha is again a density with respect to p, since by 
Fubini’s theorem (Theorem 2.2.4) 


hA{x) dp(x) = f dA(9) f fe(x)dp(x) = f dA(9) = 1. 

J OJ J J UJ 


Suppose that (/>a is a level-a test for testing H , and let 4>* be any other level-a 
test. Then since Ee(j>*(X ) < a for all 9 £ w, we have 


J <f>*(x)hA{x) dp(x) = J 


E e </>* (X)dA(9) < a. 


Therefore tj>* is a level-a test also for testing Ha and its power cannot exceed 
that of 4>a- This proves (i) and (ii). If A' is any distribution, it follows further 
that 4>a is a level-a test also for testing Ha’ , and hence that its power against g 
cannot exceed that of the most powerful test, which by definition is Pa '• ■ 

The conditions of this theorem can be given a somewhat different form by 
noting that <j> a can satisfy f Eo</>a(X) dA(9) = a and E 9 <j> a(X) < a for all 
9 € u> only if the set of 9's with Eo4>a{X) — a has A-measure one. 


Corollary 3.8.1 Suppose that A is a probability distribution over ui and that u/ 
is a subset of ui with A (us') = 1. Let 4 >a be a test such that 



if g(x) > kJf e {x)dA(9), 
if g(x) < k f fo(x)dA(9). 


(3.33) 


Then <j >a is a most powerful level-a for testing H against g provided 


E e aj) a (A) = sup EgpAiX) 
oeu 


for O' Gw'. 


= a 


(3.34) 
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Theorems 3.4.1 and 3.7.1 constitute two simple applications of Theorem 3.8.1. 
The set ui' over which the least favorable distribution A is concentrated consists 
of the single point 9q in the first of these examples and of the two points 9\ and 
02 in the second. This is what one might expect, since in both cases these are 
the distributions of H that appear to be “closest” to AT Another example in 
which the least favorable distribution is concentrated is at a single point is the 
following. 

Example 3.8.1 (Sign test) The quality of items produced by a manufacturing 
process is measured by a characteristic X such as the tensile strength of a piece 
of material, or the length of life or brightness of a light bulb. For an item to 
be satisfactory X must exceed a given constant u, and one wishes to test the 
hypothesis H : p > po, where 

p = P{X < u} 

is the probability of an item being defective. Let AT,..., X n be the measurements 
of n sample items, so that the A’s are independently distributed with common 
distribution about which no knowledge is assumed. Any distribution on the real 
line can be characterized by the probability p together with the conditional prob¬ 
ability distributions P- and P+ of X given X < u and X > u respectively. If the 
distributions P- and P+ have probability densities p~ and p+, for example with 
respect to p = P_ + P+, then the joint density of Xi,..., X n at a sample point 
xi,... ,x„ satisfying 

Xii , . . . , Xi m T U ^ %jl , ■ ■ ■ , 'Ejn — m 

is 


p m (l-p) n m p-(x il )- ■ ■p-{x im )p+{xj 1 ) ■ ■■p+(x jn _ m ). 

Consider now a fixed alternative to H , say (pi, P_, P+), with pi < po. One would 
then expect the least favorable distribution A over H to assign probability 1 
to the distribution (po,P-,P+) since this appears to be closest to the selected 
alternative. With this choice of A, the test (3.33) becomes 

/ \ m / \ n — m 

4>k{x) = 1 or 0 as J f—J > or < C, 

and hence as m < or > C. The test therefore rejects when the number M of de¬ 
fectives is sufficiently small, or more precisely, when M < C and with probability 
7 when M = C, where 

P{M < C} + yP{M — C} = a for p = po. (3.35) 

The distribution of M is the binomial distribution b(p,n), and does not depend 
on P+ and P_. As a consequence, the power function of the test depends only on 
p and is a decreasing function of p, so that under H it takes on its maximum for 
p = po. This proves A to be least favorable and 0 a to be most powerful. Since 
the test is independent of the particular alternative chosen, it is UMP. 

Expressed in terms of the variables Z t = AT — u, the test statistic M is the 
number of variables < 0, and the test is the so-called sign test (cf. Section 4.9). 
It is an example of a nonparametric test, since it is derived without assuming a 
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given functional form for the distribution of the X ’s such as the normal, uniform, 
or Poisson, in which only certain parameters are unknown . 

The above argument applies, with only the obvious modifications, to the case 
that an item satisfactory if A' lies within certain limits: u < X < v. This occurs, 
for example, if X is the length of a metal part or the proportion of an ingredient 
in a chemical compound, for which certain tolerances have been specified. More 
generally the argument applies also to the situation in which X is vector-valued. 
Suppose that an item is satisfactory only when X lies in a certain set S, for exam¬ 
ple, if all the dimensions of a metal part or the proportions of several ingredients 
lie within specified limits. The probability of a defective is then 

P = P{X € S c }, 

and P and P + denote the conditional distributions of X given X £ S and 
X £ S c respectively. As before, there exists a UMP test of H : p > po, and 
it rejects H when the number M of defectives is sufficiently small, with the 
boundary of the test being determined by (3.35). ■ 


A distribution A satisfying the conditions of Theorem 3.8.1 exists in most of 
the usual statistical problems, and in particular under the following assumptions. 
Let the sample space be Euclidean, let w be a closed Borel set in s-dimensional 
Euclidean space, and suppose that fe(x) is a continuous function of 9 for almost 
all x. Then given any g there exists a distribution A satisfying the conditions of 
Theorem 3.8.1 provided 


lirri 

n —>-oo 


fe n ( x ) dn{x) = 0 


for every bounded set S in the sample space and for every sequence of vectors 9 n 
whose distance from the origin tends to infinity. 

From this it follows as did Corollaries 1 and 4 from Theorems 3.2.1 and 3.6.1, 
that if the above conditions hold and if 0 < a < 1, there exists a test of power 
j3 > a for testing H : /e, 9 £ w, against g unless g = f fe dA(9) for some A. An 
example of the latter possibility is obtained by letting fg and g be the normal 
densities N(9,ag) and N(0,ai) respectively with <tq < a\. (See the following 
section.) 

The above and related results concerning the existence and structure of least 
favorable distributions are given in Lehmann (1952b) (with the requirement that 
a j be closed mistakenly omitted), in Reinhardt (1961), and in Krafft and Witting 
(1967), where the relation to linear programming is explored. 


3.9 Applications to Normal Distributions 

3.9.1 Univariate Normal Models 

Because of their wide applicability, the problems of testing the mean £ and vari¬ 
ance cr 2 of a normal distribution are of particular importance. Here and in similar 
problems later, the parameter not being tested is assumed to be unknown, but 
will not be shown explicitly in a statement of the hypothesis. We shall write, for 
example, a < ao instead of the more complete statement a < ag, —oo < £ < oo. 
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The standard (likelihood-ratio) tests of the two hypotheses a < a o and £ < £o 
are given by the rejection regions 


and 


— x) 2 > C 


(3.36) 


Vn{x - Co) 

v /^ rE ^-*) 2 


(3.37) 


The corresponding tests for the hypotheses a > ao and £ > £ 0 are obtained 
from the rejection regions (3.36) and (3.37) by reversing the inequalities. As will 
be shown in later chapters, these four tests are UMP both within the class of 
unbiased and within the class of invariant test (but see Section 11.3 for problems 
arising when the assumption of normality does not hold exactly). However, at 
the usual significance levels only the first of them is actually UMP. 


Example 3.9.1 (One-sided tests of variance.) Let Xi, ... ,X n be a sample 
from N(£,a 2 ), and consider first the hypotheses Hi : a > ao and H 2 : a < ao, 
and a simple alternative K : £ = £1 , a = cri. It seems reasonable to suppose that 
the least favorable distribution A in the (£, cr)-plane is concentrated on the line 
a = ao- Since Y = X) Xi/n = X and U = E(^» —X) 2 are sufficient statistics for 
the parameters (£,cr), attention can be restricted to these variables. Their joint 
density under Ha is 


C 0 u^ n 3 exp 
while under K it is 


£|) 




(y- 0 2 


dm, 


Ciu 


(71—3)/2 


exp - 


2 al) 


exp 


-2^-^ 


The choice of A is seen to affect only the distribution of Y. A least favorable A 
should therefore have the property that the density of Y under Ha, 


/ 


\fn 

\j2na\ 


exp 


n 

2°o 


{y-0 2 


dm, 


comes as close as possible to the alternative density, 


\fn 


exp 


n 
2 a\ 


(y -C 1) 2 


At this point one must distinguish between Hi and IT 2 . In the first case a 1 < cro- 
By suitable choice of A the mean of Y can be made equal to £ 1 , but the variance 
will if anything be increased over its initial value a 2 . This suggests that the least 
favorable distribution assigns probability 1 to the point £ = £ 1 , since in this way 
the distribution of Y is normal both under H and K with the same mean in both 
cases and the smallest possible difference between the variances. The situation is 
somewhat different for H 2 , for which ao < a 1 . If the least favorable distribution 
A has a density, say A', the density of Y under Ha becomes 




y/2nao 


exp 


n 

2cr o 


(: v-tf 


A'(£) d£. 
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This is the probability density of the sum of two independent random variables, 
one distributed as N(0,ao/n ) and the other with density A'(£). If A is taken to 
be X(£i, (<7i — oo)/n), the distribution of Y under Ha becomes X(£i, cr 2 /n), the 
same as under K. 

We now apply Corollary 3.8.1 with the distributions A suggested above. For 
H i it is more convenient to work with the original variables than with Y and U. 
Substitution in (3.33) gives 4>{x) = 1 when 

(2W)-" /2 exp [-^£0*-6) 2 ] 

(27T(Tq)—"/ 2 exp [-^2 EOu - 6) 2 ] 

that is, when 


E(^-^) 2 < C - 


(3.38) 


To justify the choice of A, one must show that 

takes on its maximum over the half plane a > <ro at the point £ = £i, a — ao- 
For any fixed a, the above is the probability of the sample point falling in a 
sphere radius, computed under the assumption that the X’s are independently 
distributed as X(£,cr 2 ). This probability is maximized when the center of the 
sphere coincides with that of the distribution that is, when £ = £i. (This follows 
for example from Problem 7.15.) The probability then becomes 



where Vi,... ,V n are independently distributed as N( 0,1). This is a decreasing 
function of a and therefore takes on its maximum when a = /Jo¬ 
in the case of H 2 , application of Corollary 3.8.1 to the sufficient statistics 
(Y, U) gives <f>(y, u) = 1 when 



exp | 

(- 

A) 

1 exp 



Cou^ n ~ 3 Y 2 exp | 

(-i%) 

1 / exp 

- 

^(y-0 2 

J 

A'(£) d£ 


= C exp 



>C, 


that is, when 


u = — x) 2 > C. (3.39) 

Since the distribution of ^(X, — X) 2 /cr 2 does not depend on £ or a, the proba¬ 
bility — X) 2 > C | £, a} is independent of £ and increases with a, so that 

the conditions of Corollary 3.8.1 are satisfied. The test (3.39), being independent 
of £1 and ( 7 i, is UMP for testing <7 < <70 against a > ao - It is also seen to coincide 
with the likelihood-ratio test (3.36). On the other hand, the most powerful test 
(3.38) for testing a > ao against a < ao does depend on the value £1 of £ under 
the alternative. 
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It has been tacitly assumed so far that n > 1. If n = 1, the argument applies 
without change with respect to Hi, leading to (3.38) with n = 1. However, in 
the discussion of H 2 the statistic U now drops out, and Y coincides with the 
single observation X. Using the same A as before, one sees that A' has the same 
distribution under Ha as under K, and the test (j >a therefore becomes 4>a{x) = a. 
This satisfies the conditions of Corollary 3.8.1 and is therefore the most powerful 
test for the given problem. It follows that a single observation is of no value for 
testing the hypothesis H 2 , as seems intuitively obvious, but that it could be used 
to test Hi if the class of alternatives were sufficiently restricted. ■ 

The corresponding derivation for the hypothesis £ < £0 is less straightforward. 
It turns out 10 that Student’s test given by (3.37) is most powerful if the level 
of significance a is > |, regardless of the alternative £1 > £ 0 , oi- This test is 
therefore UMP for a > On the other hand, when a < | the most powerful 
test of H rejects when J^(xi — a) 2 < b, where the constants a and b depend 
on the alternative (£i,cti) and on a. Thus for the significance levels that are of 
interest, a UMP test of H does not exist. No new problem arises for the hypothesis 
£ > £ 0 , since this reduces to the case just considered through the transformation 
Yi = £0 - (Xi - £o). 

3.9.2 Multivariate Normal Models 

Let X denote a k x 1 random vector whose ith component, X,, is a real-valued 
random variable. The mean of X, denoted E(X), is a vector with ith component 
E{Xi) (assuming it exists). The covariance matrix of X, denoted E, is the k x k 
matrix with (i,j) entry Cov(Xi, Xj). E is well-defined iff _E(|A'| 2 ) < 00 , where 
| • | denotes the Euclidean norm. Note that, if A is an m x k matrix, then the 
m x 1 vector Y = AX has mean (vector) AE(X) and covariance matrix AYA T , 
where A T is the transpose of A (Problem 3.63). 

The multivariate generalization of a real-valued normally distributed random 
variable is a random vector X = (A'i,..., Xk) T with the multivariate normal 
probability density 

T^urr ex P [“I ai ^ Xi ~ “ &)1 » ( 3 - 40 ) 

(2tv) 2 L j 

where the matrix A — (aij) is positive definite, and |A| denotes its determinant. 
The means and covariance matrix of the A’s are given by 

E( Xi)=b, E(X i -( i )(X j -(j)=a iJ , (a ij ) = A~ 1 . (3.41) 

The column vector £ = (£i,...,£fc) T is the mean vector and E = A -1 is the 
covariance matrix of A'. 

Such a definition only applies when A is nonsingular, in which case we say 
that X has a nonsingular multivariate normal distribution. More generally, we 
say that Y has a multivariate normal distribution if Y = BX + p for some mx k 
matrix of constants B and mxl constant vector p, where X has some nonsingular 
multivariate normal distribution. Then, Y is multivariate normal if and only if 


10 See Lehmann and Stein (1948) 
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'Y^iLi c iYi is univariate normal, if we interpret A(£, a 2 ) with a = 0 to be the 
distribution that is point mass at Basic properties of the multivariate normal 
distribution are given in Anderson (2003). 

Example 3.9.2 (One-sided tests of a combination of means.) Assume A 
is multivariate normal with unknown mean £ = (£i,... ,£fc) T and known covari¬ 
ance matrix E. Assume a = (ai,..., au) T is a fixed vector with a T Ea > 0. The 
problem is to test 

k k 

H : a^i <8 vs. K : a k £i > <5 . 

i =1 i=1 

We will show that a UMP level a test exists, which rejects when JT cnXi > 
tTZi- a , where a 2 = a T Ea. To see why, 11 we will consider four cases of increasing 
generality. 

Case 1. If k = 1 and the problem is to test the mean of Ai, the result follows by 
Problem 3.1. 

Case 2. Consider now general k, so that (Ai,...,Afc) has mean (£i,...,£*,) 
and covariance matrix E. However, consider the special case (ai,...,ak) = 
(1,0,..., 0). Also, assume AT and (AT,..., AT) are independent. Then, for 
any fixed alternative (£!.,•••,£*,) with > 8, the least favorable distribution 
concentrates on the single point (<5, £' 2 ,... ,£(,) (Problem 3.65). 

Case 3. As in case 2, consider ai = 1 and a, = 0 if * > 1, but now allow E to 
be an arbitrary covariance matrix. We can reduce the problem to case 2 by an 
appropriate linear transformation. Simply let Yi = Ai and, for i > 1, let 

... . v Cov(Xi, Xi) 

Yi - Xi ~ Var( AT) ' 

Then, it is easily checked that Cov(Y \, Y t ) = 0 if i > 1. Moreover, Y is just a 
1:1 transformation of X. But, the problem of testing E(\ T) = E(X i) based on 
Y = (IT,..., Y k ) is in the form already studied in case 2, and the UMP test 
rejects for large values of Yi = Ai. 

Case 4- Now, consider arbitrary (ai,..., at) satisfying a T Ea > 0. Let Z = OX, 
where O is any orthogonal matrix with first row (ai,..., a*,). Then, E(Z\) = 
all d the problem of testing E(Zi) < 8 versus E(Z i) > <5 reduces to 
case 3. Hence, the UMP test rejects for large values of Z\ = Yli -1 a iXi. ■ 

Example 3.9.3 (Equivalence tests of a combination of means.) As in Ex¬ 
ample 3.9.2, assume A is multivariate normal N{£, E) with unknown mean vector 
£ and known covariance matrix E. Fix 8 > 0 and any vector a = (ai,..., a*,) T 
satisfying a T Eo > 0. Consider testing 

k k 

H : I^a;£;|><5 vs K : |^a;£i|<<5 • 

i =1 i= 1 


11 Proposition 15.2 of van der Vaart (1998) provides an alternative proof in the case 
E is invertible. 
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Then, a UMP level a test also exists and it rejects H if 


fc 

l^aiXil < C , 

i= 1 


where C = C(a,5,a) satisfies 


$ 


C-<5 


- $ 


-C -5 


(3.42) 


and a 2 = a T Sa. Hence, the power of this test against an alternative (£i,... , £*.) 
with | J2i Oi^i| = S' < 8 is 




To see why, we again consider four cases of increasing generality. 

Case 1. Suppose k = 1, so that X\ = X is N(^,a 2 ) and we are testing |£| > 5 
versus |£| < <5. (This case follows by Theorem 3.7.1, but we argue independently 
so that the argument applies to the other cases as well.) Fix an alternative £ = m 
with \m\ < S. Reduce the composite null hypothesis to a simple one via a least 
favorable distribution that places mass p on N(S, a 2 ) and mass 1 —p on N(—6, a 2 ). 
The value of p will be chosen shortly so that such a distribution is least favorable 
(and will be seen to depend on m, a, a and 5). By the Neyman Pearson Lemma, 
the MP test of 


pN(S, a 2 ) + (1 — p)N(— <5, a 2 ) vs N(m,a 2 ) 


rejects for small values of 

Pexp [~ 2 ^( x ~ S ) 2 ] + (1 -p)exp +<5) 2 ] 

ex P[-2^( X “ m ) 2 ] 

or equivalently for small values of /(A'), where 

f{x) = pexp[(<5 — m)A/a 2 ] + (1 — p) exp[— (6 + m)X/a 2 ] . 

We can now choose p so that /(C) = /(— C), so that p must satisfy 

p _ exp[(<5 + m)C/a 2 ] — exp[— (<5 + m)C/a 2 ] 

1 — p exp[(<5 — m)C/a 2 \ — exp[— (<5 — m)C/a 2 ] 


(3.43) 


(3.44) 


Since S — m > 0 and 8 + m > 0, both the numerator and denominator of the right 
side of (3.44) are positive, so the right side is a positive number; but, p /(1 —p) is 
a nondecreasing function of p with range [0, oo) as p varies from 0 to 1. Thus, p 
is well-defined. Also, observe f"(x) > 0 for all x. It follows that (for this special 
choice of C) 


(A : /(A) < /(C)} = (A : |A| < C} 


is the rejection region of the MP test. Such a test is easily seen to be level a for 
the original composite null hypothesis because its power function is symmetric 
and decreases away from zero. Thus, the result follows by Theorem 3.8.1. 

Case 2. Consider now general k, so that (AT,..., AT) has mean (£i ,...,£*,) 
and covariance matrix E. However, consider the special case (ai,...,ak) = 



92 3. Uniformly Most Powerful Tests 


(1,0, so we are testing |Ci| > 5 versus |Ci| < 5. Also, assume AT and 

(AT,..., Xk) are independent, so that the first row and first column of E are zero 
except the first entry, which is a 2 (assumed positive). Using the same reasoning 
as case 1, fix an alternative m = (mi, ..., mu) with |mi| < S and consider testing 

pN (( 5 , m 2 ,..., m-fc), S) + (1 - p)N ((- 5 , m 2 ,..., to*,), E) 

versus N ((mi,..., m*,), E). The likelihood ratio is in fact the same as (3.43) 
because each term is now multiplied by the density of (AT,..., AT) (by inde¬ 
pendence), and these densities cancel. The UMP test from Case 1, which rejects 
when |Aii | < C, is UMP in this situation as well. 

Case 3. As in Case 2, consider <21 = 1 and a, = 0 if * >1, but now allow E to be 
an arbitrary covariance matrix. By transforming X to Y as in Case 3 of Example 
3.9.2, the result follows (Problem 3.66). 

Case 4 • Now, consider arbitrary (ai,..., at,) satisfying a T S a > 0. As in Case 4 
of Example 3.9.2), transform X to Z and the result follows (Problem 3.66). 


3.10 Problems 

Section 3.2 

Problem 3.1 Let AT,...,AT be a sample from the normal distribution 
N(i,a 2 ). 

(i) If a = cro (known), there exists a UMP test for testing H : C < £0 against 

C > Co, which rejects when — Co) is too large. 

(ii) If C = Co (known), there exists a UMP test for testing H : a < ao against 
K : a > ao, which rejects when XXAT — Co) 2 is too large. 

Problem 3.2 UMP test for U(0,9). Let X = (AT,...,AT) be a sample from 
the uniform distribution on (0,0). 

(i) For testing H : 9 < 6q against K : 9 > 9o any test is UMP at level a 
for which Ee 0 (j>(X) = a, Eo(j>(X ) < a for 9 < 9o, and (j>(x ) = 1 when 
max(*i,.. ,,i„) > 0 O . 

(ii) For testing H : 0 = 9o against K : 9 ^ 0o a unique UMP test exists, and is 
given by <j>(x ) = 1 when max(*i,..., x„) > 9 0 or max(xi,..., x„) < 0o \[a, 
and (j>(x ) = 0 otherwise. 

[(i): For each 0 > 0o determine the ordering established by r(x) = pe(x) / pe 0 (x) 
and use the fact that many points are equivalent under this ordering. 

(ii): Determine the UMP tests for testing 0 = 0o against 0 < 0o and combine 
this result with that of part (i).] 

Problem 3.3 Suppose N i.i.d. random variables are generated from the same 
known strictly increasing absolutely continuous cdf F(-). We are told only X, the 
maximum of these random variables. Is there a UMP size a test of 

Hq : N <5 versus Hi_ : N > 57 
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If so, find it. 


Problem 3.4 UMP test for exponential densities. Let Xi,, X n be a sam¬ 
ple from the exponential distribution E(a,b) of Problem 1.18, and let Am = 
minpfr,... ,X„). 

(i) Determine the UMP test for testing H : a = ao against K : a a o when b 
is assumed known. 


(ii) The power of any MP level-a test of H : a = ao against K : a = ai < ao is 
given by 

0*(ai) = 1 - (1 - a)e- n(ao ~ ai)/b . 


(iii) For the problem of part (i), when b is unknown, the power of any level a 
test which rejects when 


-Y(!) — ao 

E[-Y* - A' (1) ] 


< Ci or 


> C 2 


against any alternative (ai,6) with ai < ao is equal to /3*(ai) of part (ii) 
(independent of the particular choice of Ci and C 2 ). 


(iv) The test of part (iii) is a UMP level-a test of H : a = ao against K : a ^ a o 
(6 unknown). 


(v) Determine the UMP test for testing H : a = ao,b = bo against the 
alternatives a < ao, b < bo- 


(vi) Explain the (very unusual) existence in this case of a UMP test in the 
presence of a nuisance parameter [part(iv)] and for a hypothesis specifying 
two parameters [part(v)]. 

[(i) The variables 1) = e~ Xi ^ b are a sample from the uniform distribution on 

(0,e-“ /b ).] 

Note. For more general versions of parts (ii)-(iv) see Takeuchi (1969) and Kabe 
and Laurent (1981). 


Problem 3.5 In the proof of Theorem 3.2. 1 (i), consider the set of c satisfying 
a(c) < a < a(c — 0). If there is only one such c, c is unique; otherwise, there is 
an interval of such values [ci,c 2 ]. Argue that, in this case, if a(c) is continuous 
at c 2 , then Pi(C) = 0 for i= 0,1, where 

C = / x : po(x) > 0 and ci < < C2 

l Po(x) 

If a(c) is not continuous at c 2 , then the result is false. 



Problem 3.6 Let Po,Pi,P 2 be the probability distributions assigning to the 
integers 1,..., 6 the following probabilities: 



1 

2 

3 

4 

5 

6 

Po 

.03 

.02 

.02 

.01 

0 

.92 

Pi 

.06 

.05 

.08 

.02 

.01 

.78 

p 2 

.09 

.05 

.12 

0 

.02 

.72 
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Determine whether there exists a level-a test of H : P = Po which is UMP 
against the alternatives Pi and P 2 when (i) a = .01; (ii) a = .05; (iii) a = .07. 


Problem 3.7 Let the distribution of X be given by 


X 

0 1 2 

3 

Pe(X = x) 

9 29 .9 - 29 

.1 — 9 


where 0 < 9 < .1. For testing H : 9 = . 05 against 9 > .05 at level a = .05, 
determine which of the following tests (if any) is UMP: 

(i) 0(0) = 1,0(1) = 0(2) = 0(3) = 0; 

(ii) 0(1) = .5,0(0) = 0(2) = 0(3) = 0; 

(iii) 0(3) = 1,0(0) = 0(1) = 0(2) = 0. 

Problem 3.8 A random variable X has the Pareto distribution P(c,t) if its 
density is cr c /x c+1 , 0 < r < x, 0 < C. 

(i) Show that this defines a probability density. 

(ii) If X has distribution P(c, r), then Y = log A has exponential distribution 
E(£, b) with £ = logr, b = 1/c. 

(iii) If AT,..., X n is a sample from P(c, r), use (ii) and Problem 3.4 to obtain 
UMP tests of (a) H : r = To against r ^ To when b is known; (b) H : c = Co, 
t = t against c > Co, t < to- 

Problem 3.9 Let A' be distributed according to Pg, 9 £ ft, and let T be sufficient 
for 9. If <p(X) is any test of a hypothesis concerning 9, then ip{T) given by 
1 p(t) = E[ip( X) | t\ is a test depending on T only, an its power function is 
identical with that of <fi(X). 


Problem 3.10 In the notation of Section 3.2, consider the problem of testing 
Ho : P = P 0 against Hi : P — Pi, and suppose that known probabilities no = n 
and 7Ti = 1 — n can be assigned to Ho and Hi prior to the experiment. 


(i) The overall probability of an error resulting from the use of a test ip is 

nEop(X) + (1 - n)Ei[l - p{X)\. 

(ii) The Bayes test minimizing this probability is given by (3.8) with k = 
no j n 1 . 


(iii) The conditional probability of Hi given X = x, the posterior probability of 
Hi is 


niPi ( x ) 

n 0 po(x) + nipi(x) ’ 

and the Bayes test therefore decides in favor of the hypothesis with the 
larger posterior probability 
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Problem 3.11 (i) For testing Ho : 9 = 0 against Hi : 9 = 9i when X is 

N(9, 1), given any 0 < a < 1 and any 0 < 7r < 1 (in the notation of the 
preceding problem), there exists 0i and x such that (a) Ho is rejected when 
X = x but (b) P(Ho | x) is arbitrarily close to 1. 

(ii) The paradox of part (i) is due to the fact that a is held constant while the 
power against 9i is permitted to get arbitrarily close to 1. The paradox 
disappears if a is determined so that the probabilities of type I and type 
II error are equal [but see Berger and Sellke (1987)]. 

[For a discussion of such paradoxes, see Lindley (1957), Bartlett (1957), Schafer 
(1982, 1988) and Robert (1993).] 

Problem 3.12 Let AT,...,A'„ be independently distributed, each uniformly 
over the integers 1,2,... ,9. Determine whether there exists a UMP test for test¬ 
ing H : 9 = 9o, at level 1/9q against the alternatives (i) 9 > 9q\ (ii) 9 < 9o; (iii) 
9 + 9 0 . 

Problem 3.13 The following example shows that the power of a test can some¬ 
times be increased by selecting a random rather than a fixed sample size even 
when the randomization does not depend on the observations. Let Ai,..., X n 
be independently distributed as N(9, 1), and consider the problem of testing 
H : 9 = 0 against K : 9 = 9i > 0. 

(i) The power of the most powerful test as a function of the sample size n is 
not necessarily concave. 

(ii) In particular for a = .005, 9\ = |, better power is obtained by taking 2 or 
16 observations with probability | each than by taking a fixed sample of 
9 observations. 

(iii) The power can be increased further if the test is permitted to have different 
significance levels on and a 2 for the two sample sizes and it is required only 
that the expected significance level be equal to a = .005. Examples are: 
(a) with probability | take n\ = 2 observations and perform the test of 
significance at level an = .001, or take ri 2 = 16 observations and perform 
the test at level ai = .009; (b) with probability | take m = 0 or ri 2 = 18 
observations and let the respective significance levels be on = 0,02 = .01. 

Note. This and related examples were discussed by Kruskal in a seminar held 
at Columbia University in 1954. A more detailed investigation of the phenomenon 
has been undertaken by Cohen (1958). 

Problem 3.14 If the sample space X is Euclidean and Pa, P\ have densities with 
respect to Lcbesgue measure, there exists a nonrandomized most powerful test 
for testing Pq against Pi at every significance level a. 12 [This is a consequence of 
Theorem 3.2.1 and the following lemma. 13 Let / > 0 and f A f(x ) dx = a. Given 
any 0 < b < a, there exists a subset B of A such that f g f(x) dx = 6.] 


12 For more general results concerning the possibility of dispensing with randomized 
procedures, see Dvoretzky, Wald, and Wolfowitz (1951). 

13 For a proof of this lemma see Halmos (1974, p. 174.) The lemma is a special case of 
a theorem of Lyapounov (1940); see Blackwell(1951). 
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Problem 3.15 Fully informative statistics. A statistic T is fully informative if 
for every decision problem the decision procedures based only on T form an 
essentially complete class. If V is dominated and T is fully informative, then T 
is sufficient. [Consider any pair of distributions Po, Pi €P with densities po, pi, 
and let gi = Pi/(po + pi). Suppose that T is fully informative, and let At be the 
subfield induced by T. Then At contains the subfield induced by (go,gi) since 
it contains every rejection which is unique most powerful for testing Po against 
Pi (or Pi against Po) at some level a. Therefore, T is sufficient for every pair of 
distributions (Po,Pi), and hence by Problem 2.11 it is sufficient for V] 

Problem 3.16 Based on X with distribution indexed by 9 £ SI, the problem is 
to test 9 £ u> versus 6 £ t o' . Suppose there exists a test (j> such that Eg[<t>(X)] < /? 
for all 9 in ui, where j3 < a. Show there exists a level a test <j>*{X) such that 

Ee[<KX)] < E e [4>*{X )] , 

for all 9 in u/ and this inequality is strict if Eg[<j>(X)] < 1. 

Problem 3.17 A counterexample. Typically, as a varies the most powerful level 
a tests for testing a hypothesis H against a simple alternative are nested in the 
sense that the associated rejection regions, say R a , satisfy R a C R a ', for any 
a < a'. Even if the most powerful tests are nonrandomized, this may be false. 
Suppose X takes values 1, 2, and 3 with probabilities 0.85, 0.1, and 0.05 under 
H and probabilities 0.7, 0.2, and 0.1 under K. 

(i) At any level < .15, the MP test is not unique. 

(ii) At a = .05 and a' = .1, there exist unique nonrandomized MP tests and they 
are not nested. 

(iii) At these levels there exist MP tests <j> and < f> that are nested in the sense 
that 4>(x) < 4>'{x) for all x. [This example appears as Example 10.16 in Romano 
and Siegel (1986).] 


Problem 3.18 Under the setup of Theorem 3.2.1, show there always exists MP 
tests that are nested in the sense of Problem 3.lT(iii). 


Problem 3.19 Suppose AT,..., X n are i.i.d. AT(£, a 2 ) with a known. For testing 
£ = 0 versus £ ^ 0, the average power of a test <j> = <p(Xi ,..., X n ) is given by 



E^dMji) , 


where A is a probability distribution on the real line. Suppose that A is symmetric 
about 0; that is, A{E} = A{— E} for all Borel sets E. Show that, among a level 
tests, the one maximizing average power rejects for large values of | JT X;|. Show 
that this test need not maximize average power if A is not symmetric. 


Problem 3.20 Let fg, 9 £ fl, denote a family of densities with respect to a 
measure p. (We assume U is endowed with a u-field so that the densities fg(x) 
are jointly measurable in 9 and x.) Consider the problem of testing a simple null 
hypothesis 9 — 9o against the composite alternatives Six = {9 : 9 ^ #o}- Let A 
be a probability distribution on Six- 
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(i) As explicitly as possibly, find a test <p that maximizes fEo((j>)dA(6 ), subject 
to it being level a. 

(ii) Let h(x) = f fg(x)dA(8). Consider the nonrandomized <j> test that rejects if 
and only if h{x) / fe 0 {x) > k, and suppose p{x : h(x) = kfg(x)} = 0. Then, <j> is 
admissible at level a = Eg 0 ((/>) in the sense that it is impossible that there exists 
another level a test </>' such that Eg((f>') > Eg{(j>) for all 9. 

(iii) Show that the test of Problem 3.19 is admissible. 


Section 3.3 

Problem 3.21 In Example 3.21, show that p-value is indeed given by p = 
p(X) = (11 — X)/10. Also, graph the c.d.f. of p under H and show that the 
last inequality in (3.15) is an equality if and only u is of the form 0,.. . , 10. 

Problem 3.22 Suppose A' has a continuous distribution function F. Show that 
F(X) is uniformly distributed on (0,1). [The transformation from X to F(X) is 
known as the probability integral transformation.] 


Problem 3.23 Under the setup of Lemma 3.3.1, suppose the rejection regions 
are defined by 


Sc = {X : T(X) > fc(a)} 

for some real-valued statistic T(X) and k(a) satisfying 


Then, show 


sup Pg{T(X) > fc(a)} < a . 


(3.45) 


p = sup P{T(X) > t} , 
O^CLh 

where t is the observed value of T(X). 


Problem 3.24 Under the setup of Lemma 3.3.1, show that there exists a real¬ 
valued statistic T(X) so that the rejection region is necessarily of the form (3.45). 
[Hint Let T( X) = -p.} 

Problem 3.25 (i) If p is uniform on (0,1), show that —21og(p) has the Chi- 
squared distribution with 2 degrees of freedom. 

(ii) Suppose pi,... ,p s are i.i.d. uniform on (0,1). Let F = — 2 log(pi • • • p s ). Argue 
that F has the Chi-squared distribution with 2s degrees of freedom. What can 
you say about F if the pi are independent and satisfy P{pi < u} < u for all 
0 < u < 1? [Fisher (1934a) proposed F as a means of combining p -values from 
independent experiments.] 


Section 3-4 

Problem 3.26 Let X be the number of successes in a n independent trials with 
probability p of success, and let <(>(a;) be the UMP test (3.16) for testing p < po 
against p > po at level of significance a. 
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(i) For n = 6 , po = .25 and the levels a = .05, .1, .2 determine C and 7 , and 
the power of the test against pi = .3, .4, .5, . 6 , .7. 

(ii) If po = .2 and a = .05, and it is desired to have power 0 > .9 against 
pi = .4, determine the necessary sample size (a) by using tables of the 
binomial distribution, (b) by using the normal approximation . 14 

(iii) Use the normal approximation to determine the sample size required when 
a = .05, 0 = .9, po = .01, pi = .02. 


Problem 3.27 (i) A necessary and sufficient condition for densities pe(x) 

to have monotone likelihood ratio in x, if the mixed second derivative 
<9 2 log pe{x)/06 dx exists, is that this derivative is > 0 for all 8 and x. 


(ii) An equivalent condition is that 

d 2 p g (x) dp g (x) dp e (x) 

Mx) ^9dT-^^^r 


for all 8 and x. 


Problem 3.28 Let the probability density pe of X have monotone likelihood 
ratio in T(x), and consider the problem of testing H : 8 < 80 against 8 > do- 
If the distribution of T is continuous, the p-value p of the UMP test is given by 
p = Pe 0 {T > t}, where t is the observed value of T. This holds also without 
the assumption of continuity if for randomized tests p is defined as the smallest 
significance level at which the hypothesis is rejected with probability 1. Show 
that, for any 8 < 80 , Pg{p < it} < u for any 0 < u < 1 . 


Problem 3.29 Let Xi,...,X n be independently distributed with density 
(28)~ 1 e~ x ^ 2e , x > 0, and let Y\ < ••• < Y n be the ordered A'’s. Assume that 
Y\ becomes available first, then Y 2 , and so on, and that observation is contin¬ 
ued until Y r has been observed. On the basis of Yi,... ,Y r it is desired to test 
H : 8 > 80 = 1000 at level a = .05 against 9 < 80 ■ 

(i) Determine the rejection region when r = 4, and find the power of the test 
against 9 1 = 500. 

(ii) Find the value of r required to get power 0 > .95 against the alternative. 

[In Problem 2.15, the distribution of Y + (n — r)Y r \/8 was found to be 

X 2 with 2 r degrees of freedom.] 


Problem 3.30 When a Poisson process with rate A is observed for a time inter¬ 
val of length r, the number A' of events occurring has the Poisson distribution 
P(Xt). Under an alternative scheme, the process is observed until r events have 
occurred, and the time T of observation is then a random variable such that 2A T 
has a x 2 -distribution with 2r degrees of freedom. For testing H : A < Ao at level 
a one can, under either design, obtain a specified power 0 against an alternative 
Ai by choosing r and r sufficiently large. 


14 Tables and approximations are discussed, for example, in Chapter 3 of Johnson and 
Kotz (1969). 
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(i) The ratio of the time of observation required for this purpose under the 
first design to the expected time required under the second is Xr/r. 

(ii) Determine for which values of A each of the two designs is preferable when 

Ad — I. Ai 2.o- .05,13 = 9. 

Problem 3.31 Let X = (X\, ..., X n ) be a sample from the uniform distribution 
17(6,0 + 1). 

(i) For testing H : 9 < 9 o against K : 9 > 9o at level a there exists a UMP 
test which rejects when min(Xi,..., X n ) > 9o+C(a ) or max(AT,..., X n > 
#o + l for suitable C(a). 

(ii) The family U(9, 9+1) does not have monotone likelihood ratio. [Additional 
results for this family are given in Birnbaum (1954b) and Pratt (1958).] 

[(ii) By Theorem 3.4.1, monotone likelihood ratio implies that the family of 
UMP test of H \ 9 < 9q against K : 9 > 6 q generated as a varies from 0 to 1 is 
independent of 6 o]. 

Problem 3.32 Let X be a single observation from the Cauchy density given at 
the end of Section 3.4. 

(i) Show that no UMP test exists for testing 9 = 0 against 9 > 0. 

(ii) Determine the totality of different shapes the MP level-a rejection region 
for testing 9 = 9q against 9 = #i can take on for varying a and 9\ — 6 q. 

Problem 3.33 Let X, be independently distributed as N(iA, 1), i ■= 1,..., n. 
Show that there exists a UMP test of H : A < 0 against K : A > 0, and deter¬ 
mine it as explicitly as possible. Note. The following problems (and some of the 
Additional Problems in later chapters) refer to the gamma, Pareto, Weibull, and 
inverse Gaussian distributions. For more information about these distributions, 
see Chapter 17, 19, 20, and 25 respectively of Johnson and Kotz (1970). 


Problem 3.34 Let Xi ,..., X n be a sample from the gamma distribution F(g, b) 
with density 


q — 1 —x/b 

-ar e ' , 


F (g)b»~ ’ 

Show that there exist a UMP test for testing 

(i) H : b < bo against b > bo when g is known; 

(ii) H : g < go against g > go when b is known. 
In each case give the form of the rejection region. 


0 < x, 0 < 6, g. 


Problem 3.35 A random variable X has the Weibull distribution W(b,c) if its 
density is 


c 

b 



(x/b) c 


x > 0,6, c > 0. 


(i) Show that this defines a probability density. 
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(ii) If Xi ,..., A'„ is a sample from W ( b, c ), with the shape parameter c known, 
show that there exists a UMP test of H : b < bo against b > bo and give 
its form. 

Problem 3.36 Consider a single observation X from W{l,c). 

(i) The family of distributions does not have monotone likelihood ratio in x. 

(ii) The most powerful test of H : c = 1 against c = 2 rejects when X < k\ 
and when X > fe. Show how to determine fci and &2- 

(iii) Generalize (ii) to arbitrary alternatives ci > 1, and show that a UMP test 
of H : c = 1 against c > 1 does not exist. 

(iv) For any ci > 1, the power function of the MP test of H : c = 1 against 
c = ci is an increasing function of c. 

Problem 3.37 Let Xi ,..., X n be a sample from the inverse Gaussian distribu¬ 
tion I(fj,,r) with density 

x>0 ’ r,n>0. 

Show that there exists a UMP test for testing 

(i) H •. n < no against h > Ho when r is known; 

(ii) H : r < to against r > to when h is known. 

In each case give the form of the rejection region. 

(iii) The distribution of V = r(A'; — /i) 2 /AL/x 2 is Xi and hence that of r — 

nf/XiH 2 ] is xl- 

[Let Y = min(Xi,H 2 /Xi), Z = t(Y - H?/k^Y■ Then Z = V and Z is Xi 
[Shuster (1968)].] Note. The UMP test for (ii) is discussed in Chhikara and Folks 
(1976). 

Problem 3.38 Let Xi, ■ ■ ■, X n be a sample from a location family with common 
density — where the location parameter 9 £ R and /(•) is known. Consider 
testing the null hypothesis that 9 = 9o versus an alternative 9 = 9i for some 9 1 > 
#o- Suppose there exists a most powerful level a test of the form: reject the null 
hypothesis iff T = T(X i, • • •, X„) > C, where C is a constant and T(Ai,..., X n ) 
is location equivariant, i.e. T(Ai + c,..., X n + c) = T(Ai,..., A'„) + c for all 
constants c. Is the test also most powerful level a for testing the null hypothesis 
9 < 9o against the alternative 9 — 9 1. Prove or give a counterexample. 

Problem 3.39 Extension of Lemma 3.4.2. Let Po and Pi be two distributions 
with densities Po,Pi such that pi(x)/po(x) is a nondecreasing function of a real¬ 
valued statistic T(x). 

(i) If T has probability density p[ when the original distribution of Pi, then 
p'i{t)/po(t) is nondecreasing in t. 

(ii) Eoip{T) < Eii/>(T) for any nondecreasing function ip. 
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(iii) If p\(x)/po{x) is a strictly increasing function of t = T(x), so is 
Pi(t)/p'o(t), and Eoip(T) < E\ip(T) unless i/>[T(x)\ is constant a.e. (Po + 
Pi) or Eoip(T) = E\ip(T) = ± oo. 


(iv) For any distinct distributions with densities Po,pi, 


—oo < E 0 log 


Pi(X) 

MX) 


< E i log 


MX) 

MX) 


< oo. 


[(i): Without loss of generality suppose that pi(x)/po(x) = T(x). Then for 
any integrable <j>, 

j dv(t) = J (p[T(x)}T(x)po(x)dp,(x) = J <j>(t)tp 0 (t)dv(t), 

and hence p'i(t)/pb(t) = t a.e. 

(iv): The possibility Eq \og\pi{X) /po[X)\ = oo is excluded, since by the 
convexity of the function log, 


Polog 


Pi(X) 

MX) 


< log E 0 


Pi(X) 

MX) 


= 0 . 


Similarly for E\. The strict inequality now follows from (iii) with T(x) = 

Pi(x)/Mx)-] 


Problem 3.40 Po,Pi are two cumulative distribution functions on the real 
line, then Fi{x) < Fo(x) for all x if and only if Eotp(X) — E\ip{X) for any 
nondecreasing function 'ip. 


Problem 3.41 Let F and G be two continuous, strictly increasing c.d.f.s, and 
let k(u) = Gfp-^u)], 0 < u < 1 . 

(i) Show F and G are stochastically ordered, say F(x) < G(x) for all x, if and 
only if fc(it) < u for all 0 < u < 1 . 

(ii) If F and G have densities / and g, then show they are monotone likelihood 
ratio ordered, say g/f nondecreasing, if and only if k is convex. 

(iii) Use (i) and (ii) to give an alternative proof of the fact that MLR implies 
stochastic ordering. 


Problem 3.42 Let f(x)/[ 1 — P(*)] be the “mortality” of a subject at time x 
given that it has survived to this time. A c.d.f. F is said to be smaller than G in 
the hazard ordering if 


ff(z) < f(x) 

1 - G{x) ~ 1 - F{x) 

(i) Show that (3.46) is equivalent to 


for all x . 


1 - F{x) 
1 - G(x) 


is nonincreasing. 


(3.46) 


(3.47) 


(ii) Show that (3.46) holds if and only if k is starshaped. [A function k defined 
on an interval J C [0, oo) is starshaped on I if k(\x) < A k(x) whenever x G I, 
\x G I, 0 < A < 1. Problems 3.41 and 3.42 are based on Lehmann and Rojo 
(1992).] 
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Section 3.5 

Problem 3.43 (i) For n = 5,10 and 1 — a = .95, graph the upper confidence 

limits p and p * of Example 3.5.2 as functions of t = x + u. 

(ii) For the same values of n and «i = 02 = .05, graph the lower and upper 
confidence limits p and p. 

Problem 3.44 Confidence bounds with minimum risk. Let L(9,9) be nonnega¬ 
tive and nonincreasing in its second argument for 9 < 9, and equal to 0 for 9> 9. 
If 9 and 9* are two lower confidence bounds for 9 such that 

P 0 {9 < 9'} < Pg{9* < 9'} for all 9' < 9, 

then 

EgL(9, 9) < EgL(9, 9*). 

[Define two cumulative distribution functions F and F * by F(u) = Pg{9 < 
u}/Pg{9* < 9}, F* (u) = Pg{9* < u}/Pg{9_* < 9} for u < 9, F(u) = F*{u) = 1 
for u> 9. Then F(u) < F*(u) for all u, and it follows from Problem 3.40 that 

Eg[L(9,9)\ = Pg{9* <9} J L{9,u)dF{u) 

< Pg{9* < 9} [L{9,u)dF*(u) = E e [L{9,9*)]-] 


Section 3.6 

Problem 3.45 If f3(9) denotes the power function of the UMP test of Corollary 
3.4.1, and if the function Q of (3.19) is differentiable, then /3'(9) > 0 for all 9 for 
which Q'(9) > 0. 

[To show that fi'{9fi) > 0, consider the problem of maximizing, subject to 
Eg 0 (f)(X) = a, the derivative /3'(9o) or equivalently the quantity Eg 0 \T(X) (p(X)\.] 

Problem 3.46 Optimum selection procedures. On each member of a population 
n measurements (AT,..., X n ) = X are taken, for example the scores of n aptitude 
tests which are administered to judge the qualifications of candidates for a certain 
training program. A future measurement Y such as the score in a final test at 
the end of the program is of interest but unavailable. The joint distribution of A' 
and Y is assumed known. 

(i) One wishes to select a given proportion a of the candidates in such a way 
as to maximize the expectation of Y for the selected group. This is achieved 
by selecting the candidates for which E(Y\x) > C, where C is determined 
by the condition that the probability of a member being selected is a. 
When E(Y\x) = C, it may be necessary to randomized in order to get the 
exact value a. 

(ii) If instead the problem is to maximize the probability with which in the 
selected population Y is greater than or equal to some preassigned score 
?/o, one selects the candidates for which the conditional probability P{Y > 
yo | x} is sufficiently large. 
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[(i): Let <t>(x) denote the probability with which a candidate with measurements 
x is to be selected. Then the problem is that of maximizing 


subject to 


[yp Ylx (y ) <t>(x) d v 
J 4>(x)p x (x)dx = 


p x {x)dx 


Problem 3.47 The following example shows that Corollary 3.6.1 does not ex¬ 
tend to a countably infinite family of distributions. Let p„ be the uniform 
probability density on [ 0,1 + 1 /n], and p o the uniform density on ( 0 , 1 ). 

(i) Then po is linearly independent of {pi,P 2 , ■ ■ •), that is, there do not exist 
constants ci, C 2 ,... such that Po = Yi c nPn- 

(ii) There does not exist a test <j> such that f 4>p n = a for n = 1,2,... but 
/ <j>p 0 > a. 


Problem 3.48 Let Pi,..., F m +i be real-valued functions defined over a space 
U. A sufficient condition for uo to maximize P m +i subject to Fi(u) < d(i = 
1,... , m) is that it satisfies these side conditions, that it maximizes F m+ \{u) — 
^2 kiFi(u) for some constants ki > 0 , and that Fi(u 0 ) = d for those values i for 
which ki > 0 . 


Section 3.7 

Problem 3.49 For a random variable X with binomial distribution b(p,n), de¬ 
termine the constants Ci, 7 (i = 1, 2) in the UMP test (3.31) for testing H : p < .2 
or < .7 when a — .1 and n = 15. Find the power of the test against the alternative 
p = A. 


Problem 3.50 Totally positive families. A family of distributions with proba¬ 
bility densities pg(x),9 and x real-valued and varying over and X respectively, 
is said to be totally positive of order r(TP r ) if for all xi < • • • < x„ and 

6>i < • • • < 0 n 


Pe i(*i) 
POn (xi) 


Pe 1 (x n ) 
pe n (x n ) 


> o 


for all n = 1,2 ,..., r. 


(3.48) 


It is said to be strictly totally positive of order r ( STP r ) if strict inequality 
holds in (3.48). The family is said to be (strictly) totally positive of infinity if 
(3.48) holds for all n = 1,2,.... These definitions apply not only to probability 
densities but to any real-valued functions pg(x) of two real variables. 


(i) For r = 1, (3.48) states that Pe(x) > 0; for r = 2, that pg(x) has monotone 
likelihood ratio in x. 


(ii) If a(9) > 0,b(x) > 0, and po(x) is STP r then so is a(9)b(x)pg(x). 

(iii) If a and b are real-valued functions mapping Q and X onto Q! and X' and 
are strictly monotone in the same direction, and if pg(x) is (STP r , then 
pg<(x') with 9 1 = a -1 ($) and x' = b^ 1 (x) is ( STP) r over (Q',X'). 
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Problem 3.51 Exponential families. The exponential family (3.19) with T(x) = 
x and Q{9) = 9 is STPoo, with 12 the natural parameter space and X = (—oo, oo). 

[That the determinant | e e ' x i\,i,j = 1 is positive can be proved by 

induction. Divide the ith column by e° 1Xi ,i = 1,... ,n; subtract in the resulting 
determinant the (n — l)st column from the nth, the (n — 2)nd from the (n — l)st, 
..., the 1st from the 2nd; and expand the determinant obtained in this way by 
the first row. Then A n is seen to have the same sign as 

A'n = \e mXj — e mXi ~ 1 \, i,j = 2,...,n, 

where r/i = 9i — 9\. If this determinant is expanded by the first column one obtains 
a sum of the form 

a 2 (e V2X2 -e V2X1 ) + ••• + a n (e nnX2 - e ™ 11 ) = h{x 2 ) - h{x i) 

= (x 2 - xi)h'(y 2 )i 

where xi < y 2 < x 2 . Rewriting h'(y 2 ) as a determinant of which all columns but 
the first coincide with those of A' n and proceeding in the same manner with the 
columns, one reduces the determinant to \e ViVj \, i,j = 2 ,..., n, which is positive 
by the induction hypothesis.] 

Problem 3.52 STP3. Let 9 and x be real-valued, and suppose that the prob¬ 
ability densities pg(x) are such that pe> (x)/pg(x) is strictly increasing in x for 
9 < 9'. Then the following two conditions are equivalent: (a) For 9i < 9 2 < 9s 
and fci, k 2 , k 3 > 0, let 

g(x) = kipg 1 (x) - k 2 po 2 (x) + k 3 pg 3 (x). 

If g(x 1) — <7(0:3) = 0, then the function g is positive outside the interval (3:1,3:3) 
and negative inside, (b) The determinant A3 given by (3.48) is positive for all 
0i < 9 2 < 9 3 , *1 < x 2 < x 3 . [It follows from (a) that the equation g(x) = 0 has 
at most two solutions.] 

[That (b) implies (a) can be seen for xi, < x 2 < x 3 by considering the 
determinant 

g(x 1) g(x 2) g(x 3 ) 

Pe 2 (xi) Pe 2(3:2) Pe 2 (x 3) 

7393(3:1) pe 3 {x 2 ) pe 3 (x 3 ) 

Suppose conversely that (a) holds. Monotonicity of the likelihood ratios implies 
that the rank of A3 is at least two, so that there exist constants k\,k 2 , k 3 such that 
<7(3:1) = <7(3:3) = 0. That the k's are positive follows again from the monotonicity 
of the likelihood ratios.] 

Problem 3.53 Extension of Theorem 3.7.1. The conclusions of Theorem 3.7.1 
remain valid if the density of a sufficient statistic T (which without loss of gen¬ 
erality will be taken to be X), say pe(x), is STP3 and is continuous in x for each 
9. 

[The two properties of exponential families that are used in the proof of 
Theorem 3.7.1 are continuity in x and (a) of the preceding problem.] 

Problem 3.54 For testing the hypothesis H' : 9i < 9 < 9 2 (9i < 9 2 ) against the 
alternatives 9 < 9\ or 9 > 9 2 , or the hypothesis 9 = 9q against the alternatives 
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9^9 o, in an exponential family or more generally in a family of distributions 
satisfying the assumptions of Problem 3.53, a UMP test does not exist. 

[This follows from a consideration of the UMP tests for the one-sided 
hypotheses Hi : 9 > 9i and H 2 : 9 < 92-] 

Problem 3.55 Let /, g be two probability densities with respect to p. For test¬ 
ing the hypothesis H : 9 < 9o or 9 > #i(0 < 9o < 9i < 1 ) against the alternatives 
9 0 < 9 < 9i, in the family V = {9f(x) + (1 — 9)g{x), 0 < 9 < 1}, the test <fi(x) = a 
is UMP at level a. 


Section 3.8 

Problem 3.56 Let the variables A;(i = 1,..., s) be independently distributed 
with Poisson distribution P(A;). For testing the hypothesis H : A, < a (for 

example, that the combined radioactivity of a number of pieces of radioactive 
material does not exceed a), there exists a UMP test, which rejects when ^ Xj > 
C. 

[If the joint distribution of the A’s is factored into the marginal distribution of 
~^2Xj (Poisson with mean [U Xj ) times the conditional distribution of the vari¬ 
ables Yi = Xj/^2 Xj given Xj (multinomial with probabilities p; = Xi/^2, Xj), 
the argument is analogous to that given in Example 3.8.1.] 


Problem 3.57 Confidence bounds for a median. Let X \,..., X n be a sample 
from a continuous cumulative distribution functions F. Let £ be the unique 
median of F if it exists, or more generally let £ = inf{£' : F(fi') = |}. 


(i) If the ordered X’s are < • • • < Aq^j, a uniformly most accurate lower 
confidence bound for £ is £ = A'(^) with probability p, £ = A^fc+i) with 
probability 1 — p, where k and p are determined by 


n 





i 

^n+^-P) £ 
j=k +1 



= 1 — a. 


(ii) This bound has confidence coefficient 1 — a for any median of F. 


(iii) Determine most accurate lower confidence bounds for the lOOp-percentile 
£ of F defined by £ = inf{£' : F(£f) = p}. 


[For fixed to the problem of testing H : £ = £o to against K : £ > £o is equivalent 
to testing H' : p = | against K' : p < |,] 


Problem 3.58 A counterexample. Typically, as a varies the most powerful level 
a tests for testing a hypothesis H against a simple alternative are nested in the 
sense that the associated rejection regions, say R a , satisfy R a C R a ', for any a < 
a!. The following example shows that this need not be satisfied for composite H. 
Let A' take on the values 1, 2, 3,4 with probabilities under distributions Pq, Pi, Q: 



l 

2 

3 

4 

Po 

2 

4 

3 

4 

13 

13 

13 

13 

p \ 

4 

2 

1 

6 

J 1 

13 

13 

13 

13 

Q 

4 

3 

2 

4 

13 

13 

13 

13 
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Then the most powerful test for testing the hypothesis that the distribution of 
X is Pq or Pi against the alternative that it is Q rejects at level a = ^ when 
X = 1 or 3, and at level a = A when X = 1 or 2. 


Problem 3.59 Let X and Y be the number of successes in two sets of n binomial 
trials with probabilities pi and p 2 of success. 

(i) The most powerful test of the hypothesis H : p 2 < pi against an alternative 
( p'i,P 2 ) with pi < pi andpi+pi = 1 at level a < \ rejects when Y — X > C 
and with probability 7 when Y — X = C. 

(ii) This test is not UMP against the alternatives pi < P 2 . 

[(i): Take the distribution A assigning probability 1 to the point pi = P 2 = i 
as an a priori distribution over H. The most powerful test against (pi,pi) is then 
the one proposed above. To see that A is least favorable, consider the probability 
of rejection / 3 (pi,p 2 ) for pi = P 2 = p. By symmetry this is given by 

2/3(p,p) = P{|y - X\ > C} +~tP{\Y - X\ = C}. 

Let Xi be 1 or 0 as the ith trial in the first series is a success or failure, and 
let Yi, be defined analogously with respect to the second series. Then Y — X = 
ZUiYi — Xi), and the fact that 2/3(p,p) attains its maximum for p = -.5 can be 
proved by induction over n. 

(ii): Since /3(p,p) < a for p ^ 1, the power / 3 (pi,P 2 ) is < a for alternatives 
Pi < P 2 sufficiently close to the line pi = P 2 - That the test is not UMP now 
follows from a comparison with rf)(x, y) = a.] 


Problem 3.60 Sufficient statistics with nuisance parameters. 

(i) A statistic T is said to be partially sufficient for 6 in the presence of a 
nuisance parameter rj if the parameter space is the direct product of the 
set of possible 9- and rj- values, and if the following two conditions hold: (a) 
the conditional distribution given T = t depends only on p; (b) the marginal 
distribution of T depends only on 9. If these conditions are satisfied, there 
exists a UMP test for testing the composite hypothesis H : 9 = 9q against 
the composite class of alternatives 9 = 9 1 , which depends only on T. 

(ii) Part (i) provides an alternative proof that the test of Example 3.8.1 is 
UMP. 

[Let be the most powerful level a test for testing 9 0 against 9 1 that 

depends only on t, let <(>(*) be any level-a test, and let = E V1 [4>(X) I t\. 
Since Eg^fT) = Ee iiV1 (f)(X), it follows that ip is a level-a test of H and its 
power, and therefore the power of <p, does not exceed the power of ipo-\ 

Note. For further discussion of this and related concepts of partial sufficiency 
see Fraser (1956), Dawid (1975), Sprott (1975), Basu (1978), and Barndorff- 
Nielsen (1978). 
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Section 3.9 

Problem 3.61 Let Xi,...,X and Yi,...,Y n be independent samples from 
N(l 1) and N(r), 1), and consider the hypothesis H : rj < £ against K : r] > £. 
There exists a UMP test, and it rejects the hypothesis when Y — X is too large. 

[If < r)i, is a particular alternative, the distribution assigning probability 1 
to the point rj = £ = (m£i + nr/i)/(m + n) is least favorable.] 

Problem 3.62 Let Xi,..., X m ;Yi,..., Y n be independently, normally dis¬ 
tributed with means £ and r/, and variances a a 2 and t 2 respectively, and consider 
the hypothesis H : t < a a against K : a < t. 

(i) If £ and r/ are known, there exists a UMP test given by the rejection region 

W- V ) 2 /E (Xi-o 2 >c. 

(ii) No UMP test exists when f and rj are unknown. 

Problem 3.63 Suppose A' is a k x 1 random vector with E(|X'| 2 ) < oo and 
covariance matrix S. Let A be an m x k (nonrandom) matrix and let Y = AX. 
Show Y has mean vector AE(X) and covariance matrix AEA T . 

Problem 3.64 Suppose (Xi,...,X'fc) has the multivariate normal distribution 
with unknown mean vector £ = (£i,...,£*,) and known covariance matrix S. 
Suppose A'i is independent of (X2 ,..., AT). Show that X\ is partially sufficient 
for £1 in the sense of Problem 3.60. Provide an alternative argument for Case 2 
of Example 3.9.2. 

Problem 3.65 In Example 3.9.2, Case 2, verify the claim for the least favorable 
distribution. 

Problem 3.66 In Example 3.9.3, provide the details for Cases 3 and 4. 


3.11 Notes 

Hypothesis testing developed gradually, with early instances frequently being 
rather vague statements of the significance or nonsignificance of a set of obser¬ 
vations. Isolated applications are found in the 18th century [Arbuthnot (1710), 
Daniel Bernoulli (1734), and Laplace (1773), for example] and centuries earlier 
in the Royal Mint’s Trial of the Pyx [discussed by Stigler (1977)]. They became 
more frequent in the 19th century in the writings of such authors as Gavarret 
(1840), Lexis (1875, 1877), and Edgeworth (1885). A new stage began with the 
work of Karl Pearson, particularly his \ 2 paper of 1900, followed in the decade 
1915-1925 by Fisher’s normal theory and \ 2 tests. Fisher presented this work sys¬ 
tematically in his enormously influential book Statistical Methods for Research 
Workers (1925b). 

The first authors to recognize that the rational choice of a test must involve 
consideration not only of the hypothesis but also of the alternatives against which 
it is being tested were Neyman and F. S. Pearson (1928). They introduced the dis¬ 
tinction between errors of the first and second kind, and thereby motivated their 
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proposal of the likelihood-ratio criterion as a general method of test construc¬ 
tion. These considerations were carried to their logical conclusion by Neyman 
and Pearson in their paper of 1933. in which they developed the theory of UMP 
tests. Accounts of their collaboration can be found in Pearson’s recollections 
(1966), and in the biography of Neyman by Reid (1982). 

The Neyman-Pearson lemma has been generalized in many directions, includ¬ 
ing the results in Sections 3.6, 3.8 and 3.9. Dantzig and Wald (1951) give necessary 
conditions including those of Theorem 3.6.1, for a critical function which max¬ 
imizes an integral subject to a number of integral side conditions, to satisfy 
(3.28). The role of the Neyman-Pearson lemma in hypothesis testing is surveyed 
in Lehmann (1985a). 

An extension to a selection problem, proposed by Birnbaum and Chapman 
(1950), is sketched in Problem 3.46. Further developments in this area are re¬ 
viewed in Gibbons (1986, 1988). Grenander (1981) applies the fundamental 
lemma to problems in stochastic processes. 

Lemmas 3.4.1, 3.4.2, and 3.7.1 are due to Lehmann (1961). 

Complete class results for simple null hypothesis testing problems are obtained 
in Brown and Marden (1989). 

The earliest example of confidence intervals appears to occur in the work of 
Laplace (1812). who points out how an (approximate) probability statement con¬ 
cerning the difference between an observed frequency and a binomial probability 
p can be inverted to obtain an associated interval for p. Other examples can be 
found in the work of Gauss (1816), Fourier (1826), and Lexis (1875). However, in 
all these cases, although the statements made are formally correct, the authors 
appear to consider the parameter as the variable which with the stated proba¬ 
bility falls in the fixed confidence interval. The proper interpretation seems to 
have been pointed out for the first time by E. B. Wilson (1927). About the same 
time two examples of exact confidence statements were given by Working and 
Hotelling (1929) and Hotelling (1931). 

A general method for obtaining exact confidence bounds for a real-valued pa¬ 
rameter in a continuous distribution was proposed by Fisher (1930), who however 
later disavowed this interpretation of his work. For a discussion of Fisher’s contro¬ 
versial concept of fiducial probability, see Section 5.7. At about the same time, 15 
a completely general theory of confidence statements was developed by Neyman 
and shown by him to be intimately related to the theory of hypothesis testing. 
A detailed account of this work, which underlies the treatment given here, was 
published by Neyman in his papers of 1937 and 1938. 

The calculation of p-values was the standard approach to hypothesis testing 
throughout the 19th century and continues to be widely used today. For vari¬ 
ous questions of interpretation, extensions, and critiques, see Cox (1977), Berger 
and Sellkc (1987), Marden (1991), Hwang, Casella, Robert, Wells and Farrell 
(1992), Lehmann (1993), Robert (1994), Berger, Brown and Wolpert (1994), 
Meng (1994), Blyth and Staudte (1995, 1997), Liu and Singh (1997), Sackrowitz 
and Samuel-Cahn (1999), Marden (2000), Sellke et al. (2001), and Berger (2003). 

Extensions of p -values to hypotheses with nuisance parameters is discussed by 
Berger and Boos (1994) and Bayarri and Berger (2000), and the large-sample 


15 Cf. Neyman (1941b). 
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behavior of p -values in Lambert and Hall (1982) and Robins et al. (2000). An 
optimality theory in terms of p-values is sketched by Schweder (1988), and p- 
values for the simultaneous testing of several hypotheses is treated by Schweder 
and Spjptvoll (1982), Westfall and Young (1993), and by Dudoit et al. (2003). 

An important use of p -values occurs in meta-analysis when one is dealing with 
the combination of results from independent experiments. The early literature 
on this topic is reviewed in Hedges and Olkin (1985, Chapter 3). Additional 
references are Marden (1982b, 1985), Scholz (1982) and a review article by Becker 
(1997). Associated confidence intervals are proposed by Littell and Louv (1981). 



4 

Unbiasedness: Theory and First 
Applications 


4.1 Unbiasedness For Hypothesis Testing 

A simple condition that one may wish to impose on tests of the hypothesis H : 
9 £ Qh against the composite class of alternatives K : 9 6 Qk is that for no 
alternative in K should the probability of rejection be less than the size of the 
test. Unless this condition is satisfied, there will exist alternatives under which 
acceptance of the hypothesis is more likely than in some cases in which the 
hypothesis is true. A test rj> for which the above condition holds, that is, for 
which the power function /3,/,($) = Ee<j>{X ) satisfies 

P<t>(9) < a if 9 6 Qh, . s 

04,(9) > a if flefijc, 1 j 

is said to be unbiased. For an appropriate loss function this was seen in Chapter 
1 to be a particular case of the general definition of unbiasedness given there. 
Whenever a UMP test exists, it is unbiased, since its power cannot fall below 
that of the test <j>(x) = a. 

For a large class of problems for which a UMP test does not exist, there does 
exist a UMP unbiased test. This is the case in particular for certain hypotheses 
of the form 9 < 9o or 9 = 9o, where the distribution of the random observables 
depends on other parameters besides 9. 

When 04 ,(9) is a continuous function of 9 , unbiasedness implies 

04 ,(9) = a for all 9 in a;, (4-2) 

where u> is the common boundary of Qh and Qk that is, the set of points 9 that 
are points or limit points of both Qh and Qk- Tests satisfying this condition are 
said to be similar on the boundary (of H and K). Since it is more convenient to 
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work with (4.2) than with (4.1), the following lemma plays an important role in 
the determination of UMP unbiased tests. 


Lemma 4.1.1 If the distributions Pg are such that the power function of every 
test is continuous, and if (fio is UMP among all tests satisfying (4.2) and is a 
level-a test of H then f>o is UMP unbiased. 

Proof. The class of tests satisfying (4.2) contains the class of unbiased tests, 
and hence f>o is uniformly at least as powerful as any unbiased test. On the other 
hand, <po is unbiased, since it is uniformly at least as powerful as 4 >(x) = a. ■ 


4.2 One-Parameter Exponential Families 

Let 8 be a real parameter, and X = (Xi,... ,X n ) a random vector with 
probability density (with respect to some measure p) 

pg(x) = C( 8 )e eT ^h(x). 

It was seen in Chapter 3 that a UMP test exists when the hypothesis H and the 
class K of alternatives are given by (i) H : 6 < do, K : 6 > do (Corollary 3.4.1) 
and (ii) H : 8 < 9i or 8 > 82 (#1 < 82 ), K : 81 < 9 < 82 (Theorem 3.7.1), but not 

for (iii) H : 9i < 9 < 82 , K : 9 < 81 or 8 > 82 - We shall now show that in case 

(iii) there does exist a UMP unbiased test given by 

( 1 when T(x) < Ci or > C 2 , 

(/>(*)=< 7 i when T(x) = Ci, * = 1,2, (4.3) 

0 when G\ < T(x) < C 2 , 

where the C’s and 7’s are determined by 

E Bl <j>(X) = Es 3 (KX) = a. (4.4) 

The power function Egrf)(X ) is continuous by Theorem 2.7.1, so that Lemma 
4.1.1 is applicable. The set u> consists of the two points 8 1 and 82 , and we therefore 
consider first the problem of maximizing Egi<j>(X) for some 8' outside the interval 
[ 81 , 82 ], subject to (4.4). If this problem is restated in terms of 1 — <f(x), it follows 
from part (ii) of Theorem 3.7.1 that its solution is given by (4.3) and (4.4). This 
test is therefore UMP among those satisfying (4.4), and hence UMP unbiased 
by Lemma 4.1.1. It further follows from part (iii) of the theorem that the power 
function of the test has a minimum at a point between 8 \ and 82 , and is strictly 
increasing as 8 tends away from this minimum in either direction. 

A closely related problem is that of testing (iv) H : 8 = 80 against the alterna¬ 
tives 8 ^ 80 . For this there also exists a UMP unbiased test given by (4.3), but 
the constants are now determined by 

Ee 0 [<K X)] = a (4.5) 

and 


Ee 0 [T( X)</>(X)] = Eg 0 [T(X)]a. 


(4.6) 
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To see this, let O' be any particular alternative, and restrict attention to the 
sufficient statistic T, the distribution of which by Lemma 2.7.2, is of the form 

dPg(t) = C(9)e et dv{t). 

Unbiasedness of a test ip(t) implies (4.5) with (p(x) = ip[T(x)]; also that the 
power function /3(9) = Eg[ip(T)\ must have a minimum at 6 = 9q. By Theorem 
2.7.1, the function (3(9) is differentiable, and the derivative can be computed by 
differentiating Egip(T ) under the expectation sign, so that for all tests ip{t) 

p'(9) = Eg[TiP{T)} + ^EemT)]. 

For ip(t) = a, this equation becomes 

n _ F m , C'{9) 

°- Ee{T ) + ~c{e)- 

Substituting this in the expression for f3'(9) gives 

p'(9) = E e [T^{T)} - E e (T)E e [i>(T)], 

and hence unbiasedness implies (4.6) in addition to (4.5). 

Let M be the set of points ( Eg 0 [ip(T )\, Eg 0 [Tip(T)]) as ip ranges over the total¬ 
ity of critical functions. Then M is convex and contains all points (u,uEg 0 (T)) 
with 0 < u < 1. It also contains points (a,M2) with M2 > aEg 0 (T). This follows 
from the fact that there exist tests with Eg 0 [ip(T)\ = a and f3' (9 0) > 0 (see Prob¬ 
lem 3.45). Since similarly M contains points (a, Mi) with mi < aEg 0 (T), the point 
(a, aEg 0 (T)) is an inner point of M. Therefore, by Theorem 3.6.1(iv), there exist 
constants k\, L’2 and a test ip(t) satisfying (4.5) and (4.6) with <p{x) = ip[T(x)\, 
such that ip(t) = 1 when 

C{9 0 ){k 1 + k 2 t)e eot < C{9')e e,t 

and therefore when 

ai + a 2 t < e bt . 

This region is either one-sided or the outside of an interval. By Theorem 3.4.1, 
a one-sided test has a strictly monotone power function and therefore cannot 
satisfy (4.6). Thus ip(t) is 1 when t < C\ or > U2, and the most powerful test 
subject to (4.5) and (4.6) is given by (4.3). This test is unbiased, as is seen by 
comparing it with <j>(x) = a. It is then also UMP unbiased, since the class of tests 
satisfying (4.5) and (4.6) includes the class of unbiased tests. 

A simplification of this test is possible if for 9 = 9q the distribution of T is 
symmetric about some point a, that is, if Pe 0 {T < a — u} = Pe 0 {T > a + u} 
for all real u. Any test which is symmetric about a and satisfies (4.5) must also 
satisfy (4.6), since Eg 0 [Ttp(T)\ = Eg 0 [(T- a)ip(T)] + aEg 0 ip(T) = aa = Eg 0 (T)a. 
The C’s and 7’s are therefore determined by 

Pg 0 {T < Ci} + 71 Pe 0 {T = Ci} = f, 

C 2 = 2a — Ci, 72 =71. 

The above tests of the hypotheses 9\ < 9 < 9 2 and 9 — 9o are strictly unbiased 
in the sense that the power is > a for all alternatives 9. For the first of these 
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tests, given by (4.3) and (4.4), strict unbiasedness is an immediate consequence of 
Theorem 3.7.1 (iii). This states in fact that the power of the test has a minimum 
at a point 6 o between Q\ and 62 and increases strictly as 6 tends away from #0 
in either direction. The second of the tests, determined by (4.3), (4.5), and (4.6), 
has a continuous power function with a minimum of a at 0 = do. Thus there exist 
61 < do < #2 such that (3{Qi) = 0 ( 62 ) = c where a < c < 1. The test therefore 
coincides with the UMP unbiased level-c test of the hypothesis 9i < 0 < 62 , and 
the power increases strictly as 6 moves away from #0 in either direction. This 
proves the desired result. 


Example 4.2.1 (Binomial) Let X be the number of successes in n binomial 
trials with probability p of success. A theory to be tested assigns to p the value 
p 0 , so that one wishes to test the hypothesis H : p = po- When rejecting H one 
will usually wish to state also whether p appears to be less or greater than po- 
If, however, the conclusion that p ^ po in any case requires further investigation, 
the preliminary decision is essentially between the two possibilities that the data 
do or do not contradict the hypothesis p = po. The formulation of the problem 
as one of hypothesis testing may then be appropriate. 

The UMP unbiased test of H is given by (4.3) with T(X) = X. The condition 
(4.5) becomes 


c 2-1 

E 


a;=Ci + l 



2 


+E( 1 -^) 



Cj n—Cj 

Po % 


= 1 — a, 


and the left-hand side of this can be obtained from tables of the individual prob¬ 
abilities and cumulative distribution function of X. The condition (4.6), with the 
help of the identity 


x 



np 0 


n — 1 

x — 1 


x — 1 (n — l) — (x — l) 

Po <7o 


reduces to 



x — 1 (n — 1) — (x — 1) 

Po % 


+E( 1 -* 



p C i _ i g ( n _ i) _ ( Ci - 1) 


= 1 — a 


the left-hand side of which can be computed from the binomial tables. 

For sample sizes which are not too small, and values of po which are not too 
close to 0 or 1, the distribution of X is therefore approximately symmetric. In 
this case, the much simpler “equal tails” test, for which the C”s and 7 ’s are 
determined by 



x (n — x) 

Po% 


+ 7i 



C 1 Tl — C 1 

Po % 



Co n-Co 

Po % 


n 

+ E 

x=C 2 +1 



x n—x 


Po% 


a 

2 ’ 


= 72 
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is approximately unbiased, and constitutes a reasonable approximation to the 
unbiased test. Note, however, that this approximation requires large sample sizes 
when po is close to 0 or 1; in this connection, see Example 5.7.2 which discusses 
the corresponding problem of confidence intervals for p. The literature on this 
and other approximations to the binomial distribution is reviewed in Johnson, 
Kotz and Kemp (1992). See also the related discussion in Example 5.7.2. ■ 


Example 4.2.2 (Normal variance) Let X = (AT,..., X n ) be a sample from 
a normal distribution with mean 0 and variance a 2 , so that the density of the 
A’s is 


%/2)ro 


exp 


1 


2-7TCr 2 


E 


Then T(X) = 22 Xf is sufficient for a 2 , and has probability density (1 / u 2 )f n {y / cr 2 ), 


where 


U(y) = 


„(™/2 )-l 0 // 2 ) 

\ y c ? 


y > o, 


2"/ 2 T(n/2)' 

is the density of a ^-distribution with n degrees of freedom. For varying a, these 
distributions form an exponential family, which arises also in problems of life 
testing (see Problem 2.15), and concerning normally distributed variables with 
unknown mean and variance (Section 5.3). The acceptance region of the UMP 
unbiased test of the hypothesis H : a = ao is 

Ci<E|<C 2 


with 


and 


rC 2 

Jc 1 


fn{y) dy = 1 - a 


r ° 2 ww (i -«)e CT 0 (e-^ 2 ) 

/ yfn{y)dy= --—-- = n(l — a). 

J Ci a o 

For the determination of the constants from tables of the \ 2 -distribution, it is 
convenient to use the identity 

yfn{y) = nf n+2 (y), 


to rewrite the second condition as 



— 1 — a. 


Alternatively, one can integrate fc 2 fn(y)dy by parts to reduce the second 
condition to 


C" /2 e -Cl/2 


(j r ’-/2 e -C 2 /2 


[For tables giving C\ and C 2 see Pachares (1961).] Actually, unless n is very small 
or (Tq very close to 0 or oo, the equal-tails test given by 



fn(y) dy 



a 

2 
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is a good approximation to the unbiased test. This follows from the fact that T, 
suitably normalized, tends to be normally and hence symmetrically distributed 
for large n. ■ 

UMP unbiased tests of the hypotheses (iii) H : 9i < 6 < 62 and (iv) H : 9 = 9q 
against two-sided alternatives exist not only when the family pe{x) is exponential 
but also more generally when it is strictly totally positive (STPoo). A proof of 
(iv) in this case is given in Brown, Johnstone, and MacGibbon (1981); the proof 
of (iii) follows from Problem 3.53. 


4.3 Similarity and Completeness 

In many important testing problems, the hypothesis concerns a single real-valued 
parameter, but the distribution of the observable random variables depends in 
addition on certain nuisance parameters. For a large class of such problems a 
UMP unbiased test exists and can be found through the method indicated by 
Lemma 4.1.1. This requires the characterization of the tests 4 >, which satisfy 

E e 4>(X) = a 

for all distributions of X belonging to a given family V x = {Pe,9 £ u>}. Such 
tests are called similar with respect to V x or u>, since if <p is nonrandomized with 
critical region S, the latter is “similar to the sample space” X in that both the 
probability Pg{X £ S} and Pg{X £ X} are independent of 9 £ oj. 

Let T be a sufficient statistic for V x , and let V T denote the family {P/f, 9 £ u>} 
of distributions of T as 9 ranges over w. Then any test satisfying 1 

E[<t>(X)\t] = a a.e. V T (4.7) 

is similar with respect to V x , since then 

Eg[4>(X)] = E e {E[(j>{X)\T}} = a for all 9 £ w. 

A test satisfying (4.7) is said to have Neyman structure with respect to T. It is 
characterized by the fact that the conditional probability of rejection is a on each 
of the surfaces T = t. Since the distribution on each such surface is independent of 
9 for 9 £ u>, the condition (4.7) essentially reduces the problem to that of testing 
a simple hypothesis for each value of t. It is frequently easy to obtain a most 
powerful test among those having Neyman structure, by solving the optimum 
problem on each surface separately. The resulting test is then most powerful 
among all similar tests provided every similar test has Neyman structure. A 
condition for this to be the case can be given in terms of the following definition. 
A family V of probability distributions P is complete if 

Ep[f(X)\ = 0 for all P £ V (4.8) 

implies 

f(x ) = 0 a.e. V. (4.9) 


1 A statement is said to hold a.e. V if it holds except on a set N with P(N) = 0 for 
all PGP. 
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In applications, V will be the family of distributions of a sufficient statistic. 

Example 4.3.1 Consider n independent trials with probability p of success, and 
let Xi be 1 or 0 as the *th trial is a success or failure. Then T = X\ + • • • + X n 
is a sufficient statistic for p, and the family of its possible distributions is V = 
{b(p, n), 0 < p < 1}. For this family (4.8) implies that 

y f(t) p = 0 for all 0 < p < oo, 

where p = p/( 1 — p). The left-hand side is a polynomial in p, all the coefficients 
of which must be zero. Hence f(t) = 0 for t = 0,..., n and the binomial family 
of distributions of T is complete. ■ 


Example 4.3.2 Let Xi,, X n be a sample from the uniform distribution 
U(0,9), 0 < 9 < oo. Then T = max(Xi,..., X„) is a sufficient statistic for 
9, and (4.8) becomes 

J f (t) dPj (f) = n,9~ n j f(t) ■ t™- 1 dt = 0 for all 9. 

Let f(t) = — where f + and f~ denote the positive and negative parts 

of / respectively. Then 

v + {A) — f f + (t)t n ~ 1 dt and v~(A) = f /“(t)^” 1 dt 
J A J A 

are two measures over the Borel sets on (0, oo), which agree for all intervals and 
hence for all A. This implies f + (t) = f~(t) except possibly on a set of Lebesgue 
measure zero, and hence f(t) = 0 a.e. V T ■ ■ 

Example 4.3.3 Let Xi,...,X m ; Yi,...,Y n be independently normally dis¬ 
tributed as N(^,a 2 ) and IV(£,t 2 ) respectively. Then the joint density of the 
variables is 

c(c, cr, t) exp + • 

The statistic 

T - (E^-E-^-E^’E^ 2 ) 

is sufficient; it is, however, not complete, since E(^2,Yj/n — ^2Xi/m ) is identi¬ 
cally zero. If the Y’s are instead distributed with a mean E(Y) = p which varies 
independently of £, the set of possible values of the parameters 9\ = — l/2u 2 , 92 = 
£/a 2 ,#3 = — l/ 2 r 2 ,04 = p/r 2 contains a four-dimensional rectangle, and it 
follows from Theorem 4.3.1 below that V T is complete. ■ 

Completeness of a large class of families of distributions including that of 
Example 4.3.1 is covered by the following theorem. 


Theorem 4.3.1 Let X be a random vector with probability distribution 


dPe(x) = C{9) exp 


k 

E^w 

j=i 


dp(x), 
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and let V T be the family of distributions of T = (Ti(X ),..., Tk(X)) as 6 
ranges over the set u>. Then V T is complete provided lo contains a k-dimensional 
rectangle. 

Proof. By making a translation of the parameter space one can assume without 
loss of generality that u> contains the rectangle 

I = {(0i,...,0 fc ) : -a < Oj < a,j -■ 1,... ,k} 

Let f(t) = f + (t) — f~ff) be such that 

Eef(T) = 0 for all 6 £ lo. 

Then for all 0 € I, if u denotes the measure induced in T-space by the measure 
Ah 

J e ^ e ^f+(t)du(t) = j e^ e ^f~{t) duft) 

and hence in particular 

J f + (t) duft) = J f~(t) duff). 

Dividing / by a constant, one can take the common value of these two integrals 
to be 1, so that 

dP + ft) = f + (t) duff) and dP~ ft) = f~ ft) duff) 
are probability measures, and 

J e Ef ^ dP + (t) = J dP~ff) 

for all 0 in I. Changing the point of view, consider these integrals now as 
functions of the complex variables 9j = + ir/j,j = 1 For any fixed 

0 i,..., Oj-i, Oj+i, ■ ■ ■, 9k with real parts strictly between —a and +a, they are 
by Theorem 2.7.1 analytic functions of 9j in the strip Rj : —a < f j < a, —oo < 
r/j < oo of the complex plane. For 92,... ,9k fixed, real, and between —a and a, 
equality of the integrals holds on the line segment {(£i, 771 ) : —a < £1 < a, r/i = 0 } 
and can therefore be extended to the strip R \, in which the integrals are 
analytic. By induction the equality can be extended to the complex region 
{(0i,...,0fc) : (4 jiVj ) £ Rj f° r j = 1, ...,fc}. It follows in particular that for 
all real (771,. .. ,r/ k ) 

J dP + ft) = J e i ^ Vjtj dP~ ft). 

These integrals are the characteristic functions of the distributions P + and P~ 
respectively, and by the uniqueness theorem for characteristic functions, 2 the two 
distributions P + and P~ coincide. From the definition of these distributions it 
then follows that f + ff) = f~(t) a.e. u, and hence that fft) = 0 a.e. V T , as was 
to be proved. ■ 


2 See for example Section 26 of Billingsley (1995). 
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Example 4.3.4 (Nonparametric completeness.) Let Xi ,..., Xn be inde¬ 
pendently and identically distributed with cumulative distribution function F G 
T, where T is the family of all absolutely continuous distributions. Then the 
set of order statistics T(X) = (X(i), ..., X( N )) was shown to be sufficient for T 
in Section 2.6. We shall now prove it to be complete. Since, by Example 2.4.1, 
T'(X) = (^2 Xi, Xf,.. ., X^) is equivalent to T{X) in the sense that both 

induce the same subfield of the sample space, T'(X) is also sufficient and is com¬ 
plete if and only if T(X) is complete. To prove the completeness of T'(X) and 
thereby that of T(X), consider the family of densities 

f(X) = C{9 1 ,... ,9n) exp(—x 2N + 9ix H-b 9nx n ), 


where C is a normalizing constant. These densities are defined for all values of the 
9 's since the integral of the exponential is finite, and their distributions belong 
to T. The density of a sample of size N is 

C N exp ( — X ] N + #1 ^2 Xj + ... + 9 n ^2 x 7 ) 

and these densities constitute an exponential family To. By Theorem 4.3.1, T'(X) 
is complete for To and hence also for T, as was to be proved. 

The same method of proof establishes also the following more general result. 
Let Xij, j = 1 ,...,Ni, i = l,...,c, be independently distributed with abso¬ 
lutely continuous distributions Fi, and let Xj 1> < ••• < x\ Nt ' > denote the Ni 
observations Xu,... ,XiN t arranged in increasing order. Then the set of order 
statistics 


(*«,...,X 


(JVl) 


,X, 


(i) 


..., X, 


(,N c )i 


is sufficient and complete for the family of distributions obtained by letting 
F\,.... F c range over all distributions of T. Here completeness is proved by con¬ 
sidering the subfamily To of T in which the distributions F % have densities of the 
form 

fi(x) = Ci {On ,..., 0 iNi ) exp (~x 2Ni + Onx + ... + 9 iNi x Ni ) . 

The result remains true if T is replaced by the family F\ of continuous distri¬ 
butions. For a proof see Problem 4.13 or Bell, Blackwell, and Breiman (1960). For 
related results, see Mandelbaum and Riischendorf (1987) and Mattner (1996). ■ 


For the present purpose the slightly weaker property of bounded completeness 
is appropriate, a family V of probability distributions being boundedly complete 
if for all bounded functions /, (4.8) implies (4.9). If V is complete it is a fortiori 
boundedly complete. An example if which V is boundedly complete but not 
complete is given in Problem 4.12. For additional examples, see Hoeffding (1977), 
Bar-Lev and Plachky (1989) and Mattner (1993). 


Theorem 4.3.2 Let X be a random variable with distribution P £ V, and let T 
be a sufficient statistic for V. Then a necessary and sufficient condition for all 
similar tests to have Neyman structure with respect to T is that the family V T of 
distributions ofT is boundedly complete. 
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Proof. Suppose first that V T is boundedly complete, and let <j>(X) be similar 
with respect to V. Then 

E[4>(X) - a] = 0 for all P £V 

and hence, if ip(t) denotes the conditional expectation of <j>{X) — a given t, 
Eip(T) = 0 for all P T £ P T . 

Since ip(t) can be taken to be bounded by Lemma 2.4.1, it follows from the 
bounded completeness of P T that = 0 and hence E[<j>(X)\t\ = a a.e. V T , as 
was to be proved. 

Conversely suppose that V T is not boundedly complete. Then there exists a 
function / such that |/(f)| < M for some M, that Ef(T) = 0 for all P T £ V T 
and f(T) ^ 0 with positive probability for some P T £ V T . Let = cf(t) + a, 
where c = min(a, 1 — a)/M. Then 0 is a critical function, since 0 < 4>{t ) < 1, 
and it is a similar test, since E(p(T) = a for all P T £ P T . But <j> does not have 
Neyman structure, since <j){T ) ^ a with positive probability for at least some 
distribution in V T . ■ 


4.4 UMP Unbiased Tests for Multiparameter 
Exponential Families 


An important class of hypotheses concerns a real-valued parameter in an expo¬ 
nential family, with the remaining parameters occurring as unspecified nuisance 
parameters. In many of these cases, UMP unbiased tests exist and can be 
constructed by means of the theory of the preceding section. 

Let A' be distributed according to 


dPw (x) 


0 ( 0 ,$) exp 


k 

6 U{X) + Y,#iTi(x) 


dn(x), 


(M)£fi, (4.10) 


L i=i 1 

and let = ($i,..., $*,) and T = (Ti,..., T*,). We shall consider the problems 3 
of testing the following hypotheses Hj against the alternatives Kj, j = 1,... ,4: 


Hi : 9 < 0 o 

H 2 : 9 < 9\ or 6 > 62 

H 3 : (9i < 9 < e 2 
H 4 : 9 = 0 o 


Ki : 9 > do 
K 2 : 61 < 6 < 02 
A '3 : 9 < 9i or 6 > 8 -, 

e 0 . 


We shall assume that the parameter space ft is convex, and that it is not 
contained in a linear space of dimension < k + 1. This is the case in particular 
when Q, is the natural parameter space of the exponential family. We shall also 
assume that there are points in Q. with 0 both < and > dg, d 4 , and 62 respectively. 


3 Such problems are also treated in Johansen (1979), which in addition discusses large 
sample tests of hypotheses specifying more than one parameter. 




120 4. Unbiasedness: Theory and First Applications 


Attention can be restricted to the sufficient statistics (U, T ) which have the 
joint distribution 

dP?g(u,t) = C(M)exp + dv(u,t), (8,0) €Q. (4.11) 

When T — t is given, U is the only remaining variable and, by Lemma 2.7.2, the 
conditional distribution of U given t constitutes an exponential family 

dPg ]t (u) = C t ( 8 )e eu dv t (u). 

In this conditional situation there exists by Corollary 3.4.1 a UMP test for testing 
Hi, with critical function 0i, satisfying 

{ 1 when u > Co(t), 

7 o(f) when u = Co(t), (4-12) 

0 when u < Co(f), 

where the functions Co and 70 are determined by 

E$ 0 [<f>i(U,T)\t] — a for all f. (4-13) 


For testing H 2 in the conditional family there exists by Theorem 3.7.1 a UMP 
test with critical function 

( 1 when Ci(t) < u < C 2 (t), 

0(**,f) = < 7 i(f) when u = Ci(t), * = 1 , 2 , (4-14) 

0 when u < C\(t) or > C 2 (t), 

where the C’s and 7 ’s are determined by 

Ee 1 {<j> 2 (U, T)\t\ = Eg 2 [(p 2 (U, T)\t\ = a. (4-15) 

Consider next the test <j> 3 satisfying 

1 when u < C\(t) or > ( 72 (f), 

7 i(t) when u = Ci (t), * = 1,2, (4-16) 

0 when C\(t) < u < ( 72 (f), 

with the C’s and 7 ’s determined by 

E 6l [MU,T)\t] = Ee 2 [MU,T)\t\ = a. (4.17) 

When T = t is given, this is (by Section 4.2 of the present chapter) UMP unbiased 
for testing H 3 and UMP among all tests satisfying (4.17). 

Finally, let 04 be a critical function satisfying (4.16) with the C’s and 7 ’s 
determined by 



Ee 0 [<t> 4 ,(U,T)\t\ = a 


(4.18) 


and 


E eo [UMU,T)\t} = aEe 0 [U\t]. (4.19) 

Then given T = f, it follows again from the results of Section 4.2 that 04 is UMP 
unbiased for testing H 4 and UMP among all tests satisfying (4.18) and (4.19). 
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So far, the critical functions 4>j have been considered as conditional tests given 
T = t. Reinterpreting them now as tests depending on U and T for the hypothe¬ 
ses concerning the distribution of X (or the joint distribution of U and T) as 
originally stated, we have the following main theorem. 4 


Theorem 4.4.1 Define the critical functions <j >i by (4-12) and (4.13); (j >2 by 
(4-14) and (4-15); <p 3 by (4-16) and (4-17); <j >4 by (4-16), (4-18), and (4-19). 
These constitute UMP unbiased level-a tests for testing the hypotheses Hi ,..., Hi 
respectively when the joint distribution of U and Tis given by (4-11). 


Proof. The statistic T is sufficient for $ if 9 has any fixed value, and hence T is 
sufficient for each 


Uj = {(61,1?) : (M) = j = 0,1,2. 


By Lemma 2.7.2, the associated family of distributions of T is given by 
dPe j , 4 (t) = C( 8 j ,'&)exp('^ 2 ‘&iti\ dv ej (t), [ 6 j , 1?) € Uj j = 0,1,2. 


Since by assumption f l is convex and of dimension k + 1 and contains points on 
both sides of 9 = 9j, it follows that u>j is convex and of dimension k. Thus Uj 
contains a fc-dimensional rectangle; by Theorem 4.3.1 the family 

pJ = {p£,tf:(M)eub} 

is complete; and similarity of a test <j> on uj.j implies 


E ej [<f>(U,T)\t] = a. 

(1) Consider first Hi. By Theorem 2.7.1, the power function of all tests is 
continuous for an exponential family. It is therefore enough to prove <j> 1 to be 
UMP among all tests that are similar on uio (Lemma 4.1.1), and hence among 
those satisfying (4.13). On the other hand, the overall power of a test tj> against 
an alternative ( 6 , d) is 


E e ,*[4>(U,T)] 



(j>{u,t) dP^\u) 


dPe,#{t). 


(4.20) 


One therefore maximizes the overall power by maximizing the power of the con¬ 
ditional test, given by the expression in brackets, separately for each t. Since (pi 
has the property of maximizing the conditional power against any 6 > 9o subject 
to (4.13), this establishes the desired result. 

(2) The proof for H 2 and H 3 is completely analogous. By Lemma 4.1.1, it is 
enough to prove j >2 and <(3 to be UMP among all tests that are similar on both 
uoi and ui 2, and hence among all tests satisfying (4.15). For each t, f >2 and <(3 
maximize the conditional power for their respective problems subject to this 
condition and therefore also the unconditional power. 


4 A somewhat different asymptotic optimality property of these tests is established 
by Michel (1979). 
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(3) Unbiasedness of a test of fTt implies similarity on u>o and 

[Eo ti) <j>(U, T)] = 0 on wo- 

The differentiation on the left-hand side of this equation can be carried out under 
the expectation sign, and by the computation which earlier led to (4.6), the 
equation is seen to be equivalent to 

EO'd [U(j>{U, T) — all] = 0 on wo- 

Therefore, since Vq is complete, unbiasedness implies (4.18) and (4.19). As in 
the preceding cases, the test, which in addition satisfies (4.16), is UMP among 
all tests satisfying these two conditions. That it is UMP unbiased now follows, 
as in the proof of Lemma 4.1.1, by comparison with the test (p(u,t) = a. 

(4) The functions <j>i,..., (j >4 were obtained above for each fixed t as a function of 
u. To complete the proof it is necessary to show that they are jointly measurable 
in u and t, so that the expectation (4.20) exists. We shall prove this here for the 
case of <j >\; the proof for the other cases is sketched in Problems 4.21 and 4.22. 
To establish the measurability of (j> i, one needs to show that the functions Co(t) 
and 70 (t) defined by (4.12) and (4.13) are t-measurable. Omitting the subscript 
0, and denoting the conditional distribution function of U given T — t and for 
9 = 0 Q by 

F t (u) = Pe 0 {U < u\t}, 

one can rewrite (4.13) as 

F t (C ) - 7 [Ft(C) -F t (C- 0)] = 1 - a. 

Here C = C(t) is such that Ft{C — 0) < 1 — a < Ft(C), and hence 

C(t) = F t ~\ 1-a) 

where F^ 1 {y) = inf{u : Ft{u) > y}. It follows that C(t) and 7(t) will both be 
measurable provided Ft{u) and Ft(u — 0) are jointly measurable in u and t and 
FfT 1 (1 — a) is measurable in t. 

For each fixed u the function Ft(u) is a measurable function of t, and for 
each fixed I it is a cumulative distribution function and therefore in particular 
nondecreasing and continuous on the right. From the second property it follows 
that Ft(u) > c if and only if for each n there exists a rational number r such 
that u < r < u + 1/n and Ft(r) > c. Therefore, if the rationals are denoted by 
ri,r 2 ,..., 

{(u,t) : F t (u) > c} = P|[J |(u, t) : 0 < n - u < ^ ,F t (n) > cl 

n i ^ ' 

This shows that Ft(u) is jointly measurable in u and t. The proof for Ft (it — 0) 
is completely analogous. Since Ff 1 (y) < it if and only if Ft(u) > y, F 1 T 1 (y) is 
t-measurable for any fixed y and this completes the proof. ■ 

The test <j> 1 of the above theorem is also UMP unbiased if 12 is replaced by the 
set fl' = f2 D {(0,1 9) : 0 > #0}, and hence for testing H' : 9 = 60 against 9 > 60 . 
The assumption that fi should contain points with 9 < 60 was in fact used only 
to prove that the boundary set wo contains a fc-dimensional rectangle, and this 
remains valid if 12 is replaced by 12'. 
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The remainder of this chapter as well as the next chapter will be concerned 
mainly with applications of the preceding theorem to various statistical problems. 
While this provides the most expeditious proof that the tests in all these cases 
are UMP unbiased, there is available also a variation of the approach, which 
is more elementary. The proof of Theorem 4.4.1 is quite elementary except for 
the following points: (i) the fact that the conditional distributions of U given 
T = t constitute an exponential family, (ii) that the family of distributions of T 
is complete, (iii) that the derivative of Ee^<j>(U, T ) exists and can be computed 
by differentiating under the expectation sign, (iv) that the functions <j >\,..., 04 
are measurable. Instead of verifying (i) through (iv) in general, as was done in the 
above proof, it is possible in applications of the theorem to check these conditions 
directly for each specific problem, which in some cases is quite easy. 

Through a transformation of parameters, Theorem 4.4.1 can be extended to 
cover hypotheses concerning parameters of the form 

k 

9* = aoO + ^2 ai'di, aoj^O. 

i=l 

This transformation is formally given by the following lemma, the proof of which 
is immediate. 

Lemma 4.4.1 The exponential family of distributions (4-10) can also be written 
as 

= K(9* ,9) exp ^0*{7*(x) + 'd%T* (x)J dp,(x) 

where 

U* - 1 , T*=Ti-—U. 

ao a 0 

Application of Theorem 4.4.1 to the form of the distributions given in the 
lemma leads to UMP unbiased tests of the hypothesis Hf : 9* < 9 o and the 
analogously defined hypotheses HZ, HZ, HZ. 

When testing one of the hypotheses Hj one is frequently interested in the 
power /3(9',9) of <f>j against some alternative 9'. As is indicated by the notation 
and is seen from (4.20), this power will usually depend on the unknown nuisance 
parameters 9. On the other hand, the power of the conditional test given T = t, 

p{9'\t) = E e ,[4>{y,T)\t\, 

is independent of 9 and therefore has a known value. 

The quantity /3(9'\t) can be interpreted in two ways: (i) It is the probability of 
rejecting H when T = t. Once T has been observed to have the value t, it may 
be felt, at least in certain problems, that this is a more appropriate expression 
of the power in the given situation than /3(9', 9), which is obtained by averaging 
/3(9'\t) with respect to other values of t not relevant to the situation at hand. 
This argument leads to difficulties, since in many cases the conditioning could 
be carried even further and it is not clear where the process should stop, (ii) A 
more clear-cut interpretation is obtained by considering f3(9'\t) as an estimate of 
f3(9',9). Since 


E B ’ : oW\T)\ = P{0\V), 
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this estimate is unbiased in the sense of equation (1.11). It follows further from 
the theory of unbiased estimation and the completeness of the exponential family 
that among all unbiased estimates of /3(9\i}) the present one has the smallest 
variance. (See TPE2, Chapter 2.) 

Regardless of the interpretation, /3(9'\t) has the disadvantage compared with 
an unconditional power that it becomes available only after the observations have 
been taken. It therefore cannot be used to plan the experiment and in particular 
to determine the sample size, if this must be done prior to the experiment. On 
the other hand, a simple sequential procedure guaranteeing a specified power (3 
against the alternatives 9 = 9' is obtained by continuing taking observations until 
the conditional power f3(9'\t) is > (3. 


4.5 Comparing Two Poisson or Binomial 
Populations 

A problem arising in many different contexts is the comparison of two treatments 
or of one treatment with a control situation in which no treatment is applied. 
If the observations consist of the number of successes in a sequence of trials for 
each treatment, for example the number of cures of a certain disease, the problem 
becomes that of testing the equality of two binomial probabilities. If the basic 
distributions are Poisson, for example in a comparison of the radioactivity of two 
substances, one will be testing the equality of two Poisson distributions. 

When testing whether a treatment has a beneficial effect by comparing it with 
the control situation of no treatment, the problem is of the one-sided type. If £2 
and £1 denote the parameter values when the treatment is or is not applied, the 
class of alternatives is K : £2 > £1. The hypothesis is £2 = £1 if it is known a priori 
that there is either no effect or a beneficial one; it is £2 < £1 if the possibility 
is admitted that the treatment may actually be harmful. Since the test is the 
same for the two hypotheses, the second somewhat safer hypothesis would seem 
preferable in most cases. 

A one-sided formulation is sometimes appropriate also when a new treatment 
or process is being compared with a standard one, where the new treatment is 
of interest only if it presents an improvement. On the other hand, if the two 
treatments are on an equal footing, the hypothesis £2 = £1 of equality of two 
treatments is tested against the two-sided alternatives £2 ^ £1. The formulation 
of this problem as one of hypothesis testing is usually quite artificial, since in 
case of rejection of the hypothesis one will obviously wish to know which of the 
treatments is better. 5 Such two-sided tests do, however, have important appli¬ 
cations to the problem of obtaining confidence limits for the extent by which 
one treatment is better than the other. They also arise when the parameter £ 
does not measure a treatment effect but refers to an auxiliary variable which 
one hopes can be ignored. For example, £1 and £2 may refer to the effect of two 


5 The comparison of two treatments as a three-decision problem or as the simultaneous 
testing of two one-sided hypotheses is discussed and the literature reviewed in Shaffer 
( 2002 ). 
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different hospitals in a medical investigation in which one would like to combine 
the patients into a single study group. (In this connection, see also Section 7.3.) 

To apply Theorem 4.4.1 to this comparison problem it is necessary to express 
the distributions in an exponential form with 9 = /(£i, £ 2 ), for example 9 = £ 2—£1 
or &/£i, such that the hypotheses of interest become equivalent to those of 
Theorem 4.4.1. In the present section the problem will be considered for Poisson 
and binomial distributions; the case of normal distributions will be taken up in 
Chapter 5. 

We consider first the Poisson problem in which X and Y are independently 
distributed according to P( A) and P{p), so that their joint distribution can be 
written as 

e -(*+M) f n 1 

P{X = x,Y = y} = exp | y log - + (x + y) log Aj . 

By Theorem 4.4.1 there exist UMP unbiased tests of the four hypotheses 
Hi,..., Ha concerning the parameter 6 = log(/r/A) or equivalently concerning 
the ratio p = p/X. This includes in particular the hypotheses p < A (or p = A) 
against the alternatives p > A, and p = A against p =/= A. Comparing the distri¬ 
bution of ( X , Y) with (4.10), one has U = Y and T = X + Y, and by Theorem 
4.4.1 the tests are performed conditionally on the integer points of the line seg¬ 
ment X + Y = t in the positive quadrant of the (x, y ) plane. The conditional 
distribution of Y given X + Y = t is (Problem 2.14) 


P{Y = y\X + Y = t} 





y = 0,1,... ,t, 


the binomial distribution corresponding to t trials and probability p = p/(X + p) 
of success. The original hypotheses therefore reduce to the corresponding ones 
about the parameter p of a binomial distribution. The hypothesis H : p < aX, for 
example, becomes H : p < a/(a + 1), which is rejected when Y is too large. The 
cutoff point depends of course, in addition to a, also on t. It can be determined 
from tables of the binomial, and for large t approximately from tables of the 
normal distribution. 

In many applications the ratio p — pi A is a reasonable measure of the extent to 
which the two Poisson populations differ, since the parameters A and p measure 
the rates (in time or space) at which two Poisson processes produce the events 
in question. One might therefore hope that the power of the above tests depends 
only on this ratio, but this is not the case. On the contrary, for each fixed value 
of p corresponding to an alternative to the hypothesis being tested, the power 
/3( A, p) = /3(X, pX) is an increasing function of A, which tends to 1 as A —> 00 and 
to a as A -> 0. To see this consider the power /3(p\t) of the conditional test given 
t. This is an increasing function of t, since it is the power of the optimum test 
based on t binomial trials. The conditioning variable T has a Poisson distribution 
with parameter A(1 + p), and its distribution for varying A forms an exponential 
family. It follows Lemma 3.4.2 that the overall power E[/3(p\T)\ is an increasing 
function of A. As A —> 0 or 00 , T tends in probability to 0 or 00 , and the power 
against a fixed alternative p tends to a or 1. 

The above test is also applicable to samples X \,..., X m and Y\,... ,Y n from 
two Poisson distributions. The statistics X = all d Y = ]C7=i are 

then sufficient for A and p, and have Poisson distributions with parameters mX 
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and n/r respectively. In planning an experiment one might wish to determine 
m = n so large that the test of, say, H : p < po has power against a specified 
alternative pi greater than or equal to some preassigned /?. However, it follows 
from the discussion of the power function for n = 1, which applies equally to 
any other n, that this cannot be achieved for any fixed n, no matter how large. 
This is seen more directly by noting that as A —> 0, for both p = po and p = pi, 
the probability of the event X = Y = 0 tends to 1 . Therefore, the power of 
any level-a test against p = pi and for varying A cannot be bounded away from 
a. This difficulty can be overcome only by permitting observations to be taken 
sequentially. One can for example determine to so large that the test of the 
hypothesis pi < po/(l + po) on the basis of to binomial trials has power > /? 
against the alternative pi = pi/(l + pi). By observing (Xi, hi), (A' 2 , 12 ), ■ ■ • and 
continuing until y (A; + hi) > to, one obtains a test with power > (3 against all 
alternatives with p > pi. 6 

The corresponding comparison of two binomial probabilities is quite similar. 
Let X and Y be independent binomial variables with joint distribution 


P{X = x,Y = y} 


m } 

i 

x m 


\P 1 Q 1 

m\ 

| | 

( n\ 

X 

! 1 

{ y 


y n—y 

P2Q2 


m n 

qi q 2 exp 


,1 P 2 , p 1 

V log-log — 

<?2 qi 


Pi 

+{x + y) log — 
<?i 


The four hypotheses Hi ,..., Hi, can then be tested concerning the parameter 



or equivalently concerning the odds ratio (also called cross-product ratio) 



This includes in particular the problems of testing H[ : P 2 < pi against P 2 > pi 
and H'i \ P 2 = Pi against P 2 ^ Pi- As in the Poisson case, U = Y and T = X + Y, 
and the test is carried out in terms of the conditional distribution of Y on the 
line segment X + Y = t. This distribution is given by 


P{Y = y\X+ Y = t} = C t {p) 



y = 0 ,l,...,t, 


where 


Ct(p) = 


E 


y '=0 \t 


-y)0)P y '' 


(4.21) 


6 A discussion of this and alternative procedures for achieving the same aim is given 
by Birnbaum (1954a). 
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In the particular case of the hypotheses H[ and H'±, the boundary value 9o of 
(4.13), (4.18), and (4.19) is 0, and the corresponding value of p is po = 1. The 
conditional distribution then reduces to 


P{Y = y\X + Y % /.} - 



which is the hypergeometric distribution. 

Tables of critical values by Finney (1948) are reprinted in Biometrika Tables 
for Statisticians, Vol. 1, Table 38 and are extended in Finney, Latscha, Bennett, 
Hsu, and Horst (1963, 1966). Somewhat different ranges are covered in Armsen 
(1955), and related charts are provided by Bross and Kasten (1957). Extensive 
tables of the hypergeometric distributions have been computed by Lieberman and 
Owen (1961). Various approximations are discussed in Johnson, Kotz and Kemp 
(1992, Section 6.5). Critical values can also be easily computed with built-in 
functions of statistical packages such as R.' 

The UMP unbiased test of pi = p 2 , which is based on the (conditional) hy¬ 
pergeometric distribution, requires randomization to obtain an exact conditional 
level a for each t of the sufficient statistic T. Since in practice randomization is 
usually unacceptable, the one-sided test is frequently performed by rejecting when 
Y > C(T), where C{t) is the smallest integer for which P{Y > C(T)\T = t} < a. 
This conservative test is called Fisher’s exact test [after the treatment given in 
Fisher (1934a)], since the probabilities are calculated from the exact hypergeo¬ 
metric rather than an approximate normal distribution. The resulting conditional 
levels (and hence the unconditional level) are often considerably smaller than a , 
and this results in a substantial loss of power. An approximate test whose overall 
level tends to be closer to a is obtained by using the normal approximation to 
the hypergeometric distribution without continuity correction. [For a compari¬ 
son of this test with some competitors, see e.g. Garside and Mack (1976).] A 
nonrandomized test that provides a conservative overall level, but that is less 
conservative than the “exact” test, is described by Boschloo (1970) and by Mc¬ 
Donald, Davis, and Milliken (1977). For surveys of the extensive literature on 
these and related aspects of 2 x 2 and more generally r x c tables, see Agresti 
(1992, 2002), Sahai and Khurshid (1995) and Martin and Tapia (1998). 


4.6 Testing for Independence in a 2 x 2 Table 

Two characteristics A and B , which each member of a population may or may 
not possess, are to be tested for independence. The probabilities or proportion of 
individuals possessing properties A and B are denoted P(A) and P(B). 

If P(A) and P(B ) are unknown, a sample from one of the categories such as 
A does not provide a basis for distinguishing between the hypothesis and the 
alternatives. This follows from the fact that the number in the sample possessing 
characteristic B then constitutes a binomial variable with probability p(B\A), 
which is completely unknown both when the hypothesis is true and when it is 


~This package can be downloaded for free from http://cran.r-project.org/. 



128 


4. Unbiasedness: Theory and First Applications 


false. The hypothesis can, however, be tested if samples are taken both from 
categories A and A c , the complement of A, or both from B and B c . In the 
latter case, for example, if the sample sizes are m and n, the numbers of cases 
possessing characteristic A in the two samples constitute independent variables 
with binomial distributions b(pi, m) and b{p 2 ,n) respectively, where p\ = P(A\B) 
and P 2 = P(A\B C ). The hypothesis of independence of the two characteristics, 
P(A\B) = p(A), is then equivalent to the hypothesis pi = P 2 and the problem 
reduces to that treated in the preceding section. 

Instead of selecting samples from two of the categories, it is frequently more 
convenient to take the sample at random from the population as a whole. The 
results of such a sample can be summarized in the following 2x2 contingency 
table, the entries of which give the numbers in the various categories: 



A 



B 

X 

X' 

M 

B c 

Y 

Y' 

N 


T 

T' 

s 


The joint distribution of the variables X, X', Y, and Y 1 is multinomial, and is 
given by 


P{X = x,X’ = x',Y = y,Y' = y'} 


X X 


-^A^PabPacbpIbcPabc 


s! s / , PAB , /, PA<=B , , PABc \ 

, 77 , . | Pa c b c exp X log-b X log-t- y log- . 

clx'ly'.y'i \ pacb<= Pacbc pacbc J 


Lemma 4.4.1 and Theorem 4.4.1 are therefore applicable to any parameter of the 
form 


a* , Pab , , Pacb , , Pab c 

0 = ao log-b m log-b a 2 log-. 

PA a B c Pa c b c Pa c b c 

Putting ai = a 2 = 1, ao = —1, A = e e = (pacbPabc) / {pabPacbc), and de¬ 
noting the probabilities of A and B in the population by pa = Pab + Pab c , 
Pb = Pab +Pacb, one finds 



1 

- A 

Pab 

= PaPb + - 

^ Pa c bPab c , 



1 - A 

Pa c b 

= Pa c Pb + 

^ PacbPabc, 



1 - A 

Pabc 

= PaPbc + 

^ PACBPABC, 



1 - A 

Pacbc 

= PacPbc + 

^ PacbPabc 
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Independence of A and B is therefore equivalent to A = 1, and A < 1 and A > 1 
correspond to positive and negative dependence respectively. 8 

The test of the hypothesis of independence, or any of the four hypotheses 
concerning A, is carried out in terms of the conditional distribution of X given 
X + X' = m, X + Y = t. Instead of computing this distribution directly, consider 
first the conditional distribution subject only to the condition X + X' = m, and 
hence Y + Y' = s — m — n. This is seen to be 

P{X = x, Y .«= y\X + X' = m} 

PAB V /f Pab c \ v f Pa c b c 
Pb ) V Pb ) \ Pbc J \ p B c 

which is the distribution of two independent binomial variables, the number of 
successes in m and n trials with probability pi = pab/pb and p 2 = pab c /pb c - 
Actually, this is clear without computation, since we are now dealing with samples 
of fixed size m and n from the subpopulations B and B c and the probability of 
A in these subpopulations is pi and p 2 . If now the additional restriction A' + V = 
t is imposed, the conditional distribution of X subject to the two conditions 
X + X' — m and X + Y = t is the same as that of X given X + Y = t in the case 
of two independent binomials considered in the previous section. It is therefore 
given by 

P{ X = x\X + X' =m,X + Y = t} = C t (p) ^ ^ U _ ^ p*- x , 

x = 0 ,... ,t, 

that is, by (4.21) expressed in terms of x instead of y. (Here the choice of A' as 
testing variable is quite arbitrary; we could equally well again have chosen Y.) 
For the parameter p one finds 




P2 / Pi Pa c bPab° . 

P = ~ / — = - = A. 

?2 / qi PabPa c b c 

From these considerations it follows that the conditional test given X + X' = m, 
X + Y = t, for testing any of the hypotheses concerning A is identical with the 
conditional test given A' + Y = t of the same hypothesis concerning p — A in 
the preceding section, in which A' + X' = m was given a priori. In particular, 
the conditional test for testing the hypothesis of independence A = 1, Fisher’s 
exact test, is the same as that of testing the equality of two binomial p’s and is 
therefore given in terms of the hypergeometric distribution. 

At the beginning of the section it was pointed out that the hypothesis of 
independence can be tested on the basis of samples obtained in a number of 
different ways. Either samples of fixed size can be taken from A and A c or from 
B and B c , or the sample can be selected at random from the population at large. 
Which of these designs is most efficient depends on the cost of sampling from 


8 A is equivalent to Yule’s measure of association, which is Q = (1 — A)/(l + A). 
For a discussion of this and related measures see Goodman and Kruskal (1954, 1959), 
Edwards (1963), Haberman (1982) and Agresti (2002). 
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the various categories and from the population at large, and also on the cost 
of performing the necessary classification of a selected individual with respect 
to the characteristics in question. Suppose, however, for a moment that these 
considerations are neglected and that the designs are compared solely in terms 
of the power that the resulting tests achieve against a common alternative. Then 
the following results 9 can be shown to hold asymptotically as the total sample 
size s tends to infinity: 

(i) If samples of size m and n (m + n = s) are taken from B and B c or from A 
and A c , the best choice of m and n is m = n= s/2. 

(ii) It is better to select samples of equal size s/2 from B and B c than from A 
and A c provided |ps — || > |pa — ||. 

(iii) Selecting the sample at random from the population at large is worse than 
taking equal samples either from A and A c or from B and B c . 

These statements, which we shall not prove here, can be established by using 
the normal approximation for the distribution of the binomial variables X and 
Y when m and n are fixed, and by noting that under random sampling from the 
population at large, M/s and N/s tend in probability to pb and ps c respectively. 


4.7 Alternative Models for 2x2 Tables 


Conditioning of the multinomial model for the 2x2 table on the row (or column) 
totals was seen in the last section to lead to the two-binomial model of Section 
4.5. Similarly, the multinomial model itself can be obtained as a conditional 
model in some situations in which not only the marginal totals M, N, T, and 
T' are random but the total sample size s is also a random variable. Suppose 
that the occurrence of events (e.g. patients presenting themselves for treatment) 
is observed over a given period of time, and that the events belonging to each 
of the categories AB, A C B, AB C , A C B C are governed by independent Poisson 
processes, so that by (1.2) the numbers A', A"', V, Y' are independent Poisson 
variables with expectations A ab, A a c b, A ab c , A a c b c , and hence s is a Poisson 
variable with expectation A = A ab + A a c b + A ab c + A a c bc • 

It may then be of interest to compare the ratio Xab/Xa c b with Xab c /Xa c b c 
and in particular to test the hypothesis H : Xab/Xa c b < Xab c / Xa c b c ■ The joint 
distribution of X,X',Y,Y' constitutes a four-parameter exponential family, which 
can be written as 


P(X 


x,X' = x ',Y = y,Y' = y') 

, ,] , „ exp I* log f XabXacbc \ + ( x i + j \ A c B 
x'.x'iyiy'i { \Xab c Xa c b J 


+(y + x) log Xabc + (j/'- x) log Aacbc} . 


9 These results were conjectured by Berkson and proved by Neyman in a course on 
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Thus, UMP unbiased tests exist of the usual one- and two-sided hypotheses con¬ 
cerning the parameter 0 = \ab\a c b c / \a c b\ab c ■ These are carried out in terms 
of the conditional distribution of A' given 

X' + X =m, Y + X = t, X + X' + Y + Y' = s, 


where the last condition follows from the fact that given the first two it is equiv¬ 
alent to Y' — X = s — t — m. By Problem 2.14, the conditional distribution of 
X, X ', Y given X + X' + Y + Y' = s is the multinomial distribution of Section 
4.6 with 


A ab 

Pab = . , Pa c b 


\a c b 

A 


Pab c 


A ab c 
A 


Pa c b c 


\a c b c 

A 


The tests therefore reduce to those derived in Section 4.6. 

The three models discussed so far involve different sampling schemes. However, 
frequently the subjects for study are not obtained by any sampling but are the 
only ones readily available to the experimenter. To create a probabilistic basis 
for a test in such situations, suppose that B and B c are two treatments, either 
of which can be assigned to each subject, and that A and A c denote success or 
failure (e.g. survival, relief of pain, etc.). The hypothesis of no difference in the 
effectiveness of the two treatments (i.e. independence of A and B) can then be 
tested by assigning the subjects to the treatments, say m to B and n to B c , at 
random, i.e. in such a way that all possible assignments are equally likely. It 
is now this random assignment which takes the place of the sampling process in 
creating a probability model, thus making it possible to calculate significance. 

Under the hypothesis H of no treatment difference, the success or failure of a 
subject is independent of the treatment to which it is assigned. If the numbers of 
subjects in categories A and A c are t and t' respectively (t +1' = s), the values 
of t and t' are therefore fixed, so that we are now dealing with a 2 x 2 table in 
which all four margins t, t', m, n are fixed. 

Then any one of the four cell counts A', X ', Y, Y' determines the other three. 
Under H, the distribution of Y is the hypergeometric distribution derived as the 
conditional null distribution of Y given X + Y = t at the end of Section 4.5. 
The hypothesis is rejected in favor of the alternative that treatment B c enhances 
success if Y is sufficiently large. Although this is the natural test under the 
given circumstances, no optimum property can be claimed for it, since no clear 
alternative model to H has been formulated. 10 

Consider finally the situation in which the subjects are again given rather than 
sampled, but B and B c are attributes (for example, male or female, smoker or 
nonsmoker) which cannot be assigned to the subjects at will. Then there exists 
no stochastic basis for answering the question whether observed differences in the 
rates X/M and Y/N correspond to differences between B and B c , or whether they 
are accidental. An approach to the testing of such hypotheses in a nonstochastic 
setting has been proposed by Freedman and Lane (1982). 


111 The one-sided test is of course UMP against the class of alternatives defined by the 
right side of (4.21), but no reasonable assumptions have been proposed that would lead 
to this class. For suggestions of a different kind of alternative see Gokhale and Johnson 
(1978). 
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The various models for the 2x2 table discussed in Sections 4.6 and 4.7 may 
be characterized by indicating which elements are random and which fixed: 

(i) All margins and s random (Poisson). 

(ii) All margins are random, s fixed (multinomial sampling). 

(iii) One set of margins random, the other (and then a fortiori s) fixed (binomial 
sampling). 

(iv) All margins fixed. Sampling replaced by random assignment of subjects to 
treatments. 

(v) All aspects fixed; no element of randomness. 

In the first three cases there exist UMP unbiased one- and two-sided tests of the 
hypothesis of independence of A and B. These tests are carried out by condi¬ 
tioning on the values of all elements in (i)-(iii) that are random, so that in the 
conditional model all margins are fixed. The remaining randomness in the table 
can be described by any one of the four cell entries; once it is known, the others 
are determined by the margins. The distribution of such an entry under H has 
the hypergeometric distribution given at the end of Section 4.5. 

The models (i)-(iii) have a common feature. The subjects under observation 
have been obtained by sampling from a population, and the inference correspond¬ 
ing to acceptance or rejection of H refers to that population. This is not true in 
cases (iv) and (v). 

In (iv) the subjects are given, and a probabilistic basis is created by assigning 
them at random, m to B and n to B. Under the hypothesis H of no treatment 
difference, the four margins are fixed without any conditioning, and the four 
cell entries are again determined by any one of them, which under H has the 
same hypergeometric distribution as before. The present situation differs from 
the earlier three in that the inference cannot be extended beyond the subjects at 
hand. 11 

The situation (v) is outside the scope of this book, since it contains no basis 
for the type of probability calculations considered here. Problems of this kind are 
however of great importance, since they arise in many observational (as opposed 
to experimental) studies. For a related discussion, see Finch (1979). 


4.8 Some Three-Factor Contingency Tables 

When an association between A and B exists in a 2 x 2 table, it does not follow 
that one of the factors has a causal influence on the other. Instead, the explanation 
may, for example, be in the fact that both factors are causally affected by a third 
factor C. If C has K possible outcomes Ci,, Ck, one may then be faced with 
the apparently paradoxical situation (known as Simpson’s paradox) that A and 
B are independent under each of the conditions Ck (k = 1,..., K) but exhibit 
positive (or negative) association when the tables are aggregated over C that 


11 For a more detailed treatment of the distinction between population models [such 
as (i)—(iii)] and randomization models [such as (iv)], see Lehmann (1998). 
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is, when the K separate 2x2 tables are combined into a single one showing 
the total counts of the four categories. [An interesting example is discussed in 
Agresti (2002).] In order to determine whether the association of A and B in 
the aggregated table is indeed “spurious”, one would test the hypothesis, (which 
arises also in other contexts) that A and B are conditionally independent given 
Ck for all fc = 1,..., K, against the alternative that there is an association for at 
least some fc. 

Let Xk, X' k , Y k , denote the counts in the 4A' cells of the 2 x 2 x K table 
which extends the 2x2 table of Section 4.6 to the present case. 

Again, several sampling schemes are possible. Consider first a random sample 
of size s from the population at large. The joint distribution of the 4 K cell 
counts then is multinomial with probabilities pABC k , PABC k > PABC k > PASc k f° r 
the outcomes indicated by the subscripts. If A k denotes the AB odds ratio for 
Ck defined by 

A PABC k PABC k P AB\C k P AB\C k 

A k = - = -, 

PABC k PABC k P AB\C k P AB\C k 

where PAB\c k ■ ■ ■ denotes the conditional probability of the indicated event given 
Ck, then the hypothesis to be tested is A k = 1 for all fc. 

A second scheme takes samples of size s k from C k and classifies the subjects 
as AB, AB, AB or AB. This is the case of K independent 2x2 tables, in which 
one is dealing with K quadrinomial distributions of the kind considered in the 
preceding sections. Since the fcth of these distributions is also that of the same 
four outcomes in the first model conditionally given Ck, we shall denote the 
probabilities of these outcomes in the present model again by PAB\c k , ■ ■ ■■ 

To motivate the next sampling scheme, suppose that A and A represent success 
or failure of a medical treatment, B and B that the treatment is applied or the 
subject is used as a control, and Ck the fcth hospital taking part in this study. If 
samples of size n k and rrik are obtained and are assigned to treatment and control 
respectively, we are dealing with K pairs of binomial distributions. Letting Yk 
and X k denote the number of successes obtained by the treatment subjects and 
controls in the fcth hospital, the joint distribution of these variables by Section 
4.5 is 

n (::)(;:)« 

where p\ k and q\k, (p 2 k and q^k ) denote the probabilities of success and failure 
under B (under B ). 

The above three sampling schemes lead to 2 x 2 x K tables in which respectively 
none, one, or two of the margins are fixed. Alternatively, in some situations 
a model may be appropriate in which the 4 K variables X k , X' k , Y k , Y k are 
independent Poisson with expectations A ABC k , ■ ■ ■■ In this case, the total sample 
size s is also random. 

For a test of the hypothesis of conditional independence of A and B given Ck 
for all fc (i.e. that Ai = • • • = A*, = 1), see Problem 12.65. Here we shall consider 
the problem under the simplifying assumption that the A k have a common value 
A, so that the hypothesis reduces to H : A = 1. Applying Theorem 4.4.1 to the 
third model (K pairs of binomials) and assuming the alternatives to be A > 1, 
we see that a UMP unbiased test exists and rejects H when ^ Y k > C (AT + 


ex P log A k +^2(xk + yk) log 
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Yi,... ,Xk + Yk), where C is determined so that the conditional probability of 
rejection, given that Xk + Yk = tk, is a for all k = 1,... ,K. It follows from 
Section 4.5 that the conditional joint distribution of the Yk under H is 

Ph [Yl = j/i,..., Yk = yx\Xk + Yk = tk, k = 1,..., K] 

( m k \ ( n k\ 

n 'tk-Vkl 

^ m k +n k ^ 

The conditional distribution of ^ Yk can now be obtained by adding the proba¬ 
bilities over all (yi ,..., yx) whose sum has a given value. Unless the numbers are 
very small, this is impractical and approximations must be used [see Cox (1966) 
and Cart (1970)]. 

The assumption H' : Ai = • • • = A k = A has a simple interpretation when 
the successes and failures of the binomial trials are obtained by dichotomizing 
underlying unobservable continuous response variables. In a single such trial, 
suppose the underlying variable is Z and that success occurs when Z > 0 and 
failure when Z < 0. If Z is distributed as F(Z — () with location parameter (, 
we have p = 1 — F(—() and q = F(—(). Of particular interest is the logistic 
distribution, for which F(x) = 1/(1 + e~ x ). In this case p = e'”/(l + e^), q = 
l/(l + e^), and hence log(p/g) = (j. Applying this fact to the success probabilities 

Plk = 1 — F(—(lk), P2k = 1 — F(—(2k), 

we find that 

0 k = log A k = log = ( 2k - (ik, 

\ 02 k j 01 k J 

so that (‘ 2 k = Ci k + dk■ In this model, H' thus reduces to the assumption that 
( 2 k = Ci k + 0, that is, that the treatment shifts the distribution of the underlying 
response by a constant amount 0 . 

If it is assumed that F is normal rather than logistic, F(x ) = 4>(cc) say, then 
C = 4? _1 (p), and constancy of ( 2 k — (ik requires the much more cumbersome 
condition — 4>~ 1 (pu,) = constant. However, the functions log (p/q) and 

< l? _1 (p) agree quite well in the range .1 < p < .9 [see Cox (1970, p. 28)], and 
the assumption of constant A k in the logistic response model is therefore close 
to the corresponding assumption for an underlying normal response. 12 [The so- 
called loglinear models, which for contingency tables correspond to the linear 
models to be considered in Chapter 7 but with a logistic rather than a normal 
response variable, provide the most widely used approach to contingency tables. 
See, for example, the books by Cox (1970), Haberman (1974), Bishop, Fienberg, 
and Holland (1975), Fienberg (1980), Plackett (1981), and Agresti (2002).] 

The UMP unbiased test, derived above for the case that the B- and C-margins 
are fixed, applies equally when any two margins, any one margin, or no mar¬ 
gins are fixed, with the understanding that in all cases the test is carried out 
conditionally, given the values of all random margins. 


12 The problem of discriminating between a logistic and normal response model is 
discussed by Chambers and Cox (1967). 
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The test is also used (but no longer UMP unbiased) for testing H : Ai = • • • = 
A k = 1 when the A’s are not assumed to be equal but when the A* — 1 can be 
assumed to have the same sign, so that the departure from independence is in the 
same direction for all the 2x2 tables. A one- or two-sided version is appropriate 
as the alternatives do or do not specify the direction. For a discussion of this test, 
the Cochran-Mantel-Haenszel test, and some of its extensions see Agresti (2002, 
Section 7.4). 

Consider now the case K = 2, with rrik and rik fixed, and the problem of 
testing H' : A 2 = Ai rather than assuming it. The joint distribution of the A’s 
and Y’s given earlier can then be written as 


n 


m k 

Xk 


Uk 
Vk , 


mu n u 

<hk%k 


: exp (y 2 log ^ + (j/i + y 2 ) log Ai + ^( x t + y t ) log ^ j , 


and H' is rejected in favor of A 2 > Ai if Y 2 > C, where C depends on Vj + Y 2 , 
Ai +Yi and X 2 + Y 2 , and is determined so that the conditional probability of 
rejection given Ij + Y 2 = w, Ai + Y\ = ti, X 2 + Y 2 = t 2 is a. The conditional 
null distribution of Yi and V 2 , given A k + Yk = tk (k — 1, 2), by (4.21) with A 
in place of p is 


and hence the conditional distribution of Y 2 , given in addition that Y\ + Y 2 = w, 
is of the form 


k(t!,t 2 ,w)( mi V ni V m2 V” 2 ) . 

V \y + ti - wj \w - yj \t 2 - yj \y ) 

Some approximations to the critical value of this test are discussed by Birch 
(1964); see also Venable and Bhapkar (1978). [Optimum large-sample tests of 
some other hypotheses in 2 x 2 x 2 tables are obtained by Cohen, Gatsonis, and 
Marden (1983).] 


4.9 The Sign Test 

To test consumer preferences between two products, a sample of n subjects are 
asked to state their preferences. Each subject is recorded as plus or minus as 
it favors product B or A. The total number Y of plus signs is then a binomial 
variable with distribution b(p, n ). Consider the problem of testing the hypothesis 
p = | of no difference against the alternatives p ^ \ (As in previous such 
problems, we disregard here that in case of rejection it will be necessary to decide 
which of the two products is preferred.) The appropriate test is the two-sided sign 
test, which rejects when |Y— |n| is too large. This is UMP unbiased (Section 4.2). 

Sometimes the subjects are also given the possibility of declaring themselves 
as undecided. If p_, p+, and po denote the probabilities of preference for product 
A, product B , and of no preference respectively, the numbers A', Y, and Z of 
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decisions in favor of these three possibilities are distributed according to the 
multinomial distribution 


^W.P-P+Po (x + y + z = n ), (4.22) 

and the hypothesis to be tested is H : p + = The distribution (4.22) can also 
be written as 


n'- ( P+ V f Po 

x\y\z\ V 1 - Po - P+ ) \1 - po - P+ 


(1-po -P+)", 


(4.23) 


and is then seen to constitute an exponential family with U = Y , T = Z, 8 = 
log[p + /(l — po — p+)], $ = log[po/(l — Po — p+)]- Rewriting the hypothesis H 
as p+ = 1 — po — P+ it is seen to be equivalent to 9 = 0. There exists therefore 
a UMP unbiased test of H , which is obtained by considering z as fixed and 
determining the best unbiased conditional test of H given Z = z. Since the 
conditional distribution of Y given z is a binomial distribution b(p, n — z) with 
p = p+/(p+ + p_), the problem reduces to that of testing the hypothesis p = 
| in a binomial distribution with n — 2 trials, for which the rejection region 
is \Y - ±(n - 2 )| > C(z). The UMP unbiased test is therefore obtained by 
disregarding the number of cases in which no preference is expressed (the number 
of ties), and applying the sign test to the remaining data. 

The power of the test depends strongly on po, which governs the distribution of 
Z. For large po, the number n — z of trials in the conditional binomial distribution 
can be expected to be small, and the test will thus have little power. This may be 
an advantage in the present case, since a sufficiently high value of po, regardless 
of the value of p+ /p- , implies that the population as a whole is largely indifferent 
with respect to the products. 

The above conditional sign test applies to any situation in which the obser¬ 
vations are the result of n independent trials, each of which is either a success 
(+), a failure (—), or a tie. As an alternative treatment of ties, it is sometimes 
proposed to assign each tie at random (with probability \ each) to either plus or 
minus. The total number Y' of plus signs after the ties have been broken is then a 
binomial variable with distribution b(n,n), where n = p+ + \po- The hypothesis 
H becomes tx = |, and is rejected when \Y’ — |n| > C, where the probability 
of rejection is a when n = This test can be viewed also as a randomized test 
based on A', Y , and Z, and it is unbiased for testing H in its original form, since 
p+ is = or ^ p_ as 7r is = or 1. Since the test involves randomization other 
than on the boundaries of the rejection region, it is less powerful than the UMP 
unbiased test for this situation, so that the random breaking of ties results in a 
loss of power. 

This remark might be thought to throw some light on the question of whether 
in the determination of consumer preferences it is better to permit the subject 
to remain undecided or to force an expression of preference. However, here the 
assumption of a completely random assignment in case of a tie does not apply. 
Even when the subject is not conscious of a definite preference, there will usually 
be a slight inclination toward one of the two possibilities, which in a majority 
of the cases will be brought out by a forced decision. This will be balanced in 
part by the fact that such forced decisions are more variable than those reached 
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voluntarily. Which of these two factors dominates depends on the strength of the 
preference. 

Frequently, the question of preference arises between a standard product and 
a possible modification or a new product. If each subject is required to express a 
definite preference, the hypothesis of interest is usually the one sided hypothesis 
p+ < p-, where + denotes a preference for the modification. However, if an 
expression of indifference is permitted the hypothesis to be tested is not p+ < 
p_but rather p+ < po + P-, since typically the modification is of interest only if 
it is actually preferred. As was shown in Example 3.8.1, the one-sided sign test 
which rejects when the number of plus signs is too large is UMP for this problem. 

In some investigations, the subject is asked not only to express a preference 
but to give a more detailed evaluation, such as a score on some numerical scale. 
Depending on the situation, the hypothesis can then take on one of two forms. One 
may be interested in the hypothesis that there is no difference in the consumer’s 
reaction to the two products. Formally, this states that the distribution of the 
scores Xi,... ,X n expressing the degree of preference of the n subjects for the 
modified product is symmetric about the origin. This problem, for which a UMP 
unbiased test does not exist without further assumptions, will be considered in 
Section 6.10. 

Alternatively, the hypothesis of interest may continue to be H : p + = p_. Since 
p_ = P{X < 0} and p+ = P{X > 0}, this now becomes 

H : P{X > 0} = P{X < 0}. 

Here symmetry of X is no longer assumed even when P{X < 0} = P{X > 0}. If 
no assumptions are made concerning the distribution of X beyond the fact that 
the set of its possible values is given, the sign test based on the number of X’s 
that are positive and negative continues to be UMP unbiased. 

To see this, note that any distribution of X can be specified by the probabilities 

P- = P{X < 0 }, P+ = P{X > 0 }, Po = P{X = 0 }, 

and the conditional distributions F- and F+ of X given X < 0 and X > 0 
respectively. Consider any fixed distributions F~, n, and denote by To the 
family of all distributions with F- = F'_, F+ = F' + and arbitrary p_, p+, po- 
Any test that is unbiased for testing H in the original family of distributions T 
in which FT and F + are unknown is also unbiased for testing F[ in the smaller 
family To. We shall show below that there exists a UMP unbiased test <j >o of FI 
in To- It turns out that (j >o is also unbiased for testing FI in T and is independent 
of F'_ , F' + . Let (j> be any other unbiased test of FI in T. and consider any fixed 
alternative, which without loss of generality can be assumed to be in To- Since 
4> is unbiased for T, it is unbiased for testing p+ = p_ in To\ the power of <j >o 
against the particular alternative is therefore at least as good as that of <j>. Hence 
4>o is UMP unbiased. 

To determine the UMP unbiased test of FI in To, let the densities of F'_ and 
F' + with respect to some measure p be f'_ and /+. The joint density of the X : s 
at a point (*i,..., x n ) with 

,..., x ir < 0 = x :jl = ■■■ = x je <x ki ,..., x km 
is 

P-PoP+f- (*»l )•••/- {Xir )f+ ( x kl )•••/+ ( Xk m ) ■ 
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The set of statistics (r, s, m) is sufficient for (p_, po, p +), and its distribution is 
given by (4.22) with x = r, y = m, z = s. The sign test is therefore seen to be 
UMP unbiased as before. 

A different application of the sign test arises in the context of a 2 x 2 table 
for matched pairs. In Section 4.5, success probabilities for two treatments were 
compared on the basis of two independent random samples. Unless the population 
of subjects from which these samples are drawn is fairly homogeneous, a more 
powerful test can often be obtained by using a sample of matched pairs (for 
example, twins or the same subject given the treatments at different times). For 
each pair there are then four possible outcomes: (0,0), (0,1), (1,0), and (1,1), 
where 1 and 0 stand for success and failure, and the first and second number in 
each pair of responses refer to the subject receiving treatment 1 or 2 respectively. 

The results of such a study are sometimes displayed in a 2 x 2 table, 


1st 



0 

1 

0 

X 

A" 

1 

Y 

Y' 


which despite the formal similarity differs from that considered in Section 4.6. 
If a sample of s pairs is drawn, the joint distribution of X, Y, X', Y' as before 
is multinomial, with probabilities poo, Poi, Pio,Pn- The success probabilities of 
the two treatments are m = pio + pn for the first and tt 2 = poi + Pn for the 
second treatment, and the hypothesis to be tested is H : 7ri = tt 2 or equivalently 
p io = poi rather than pioPoi = PooPn as it was earlier. 

In exponential form, the joint distribution can be written as 


sipfr cxp 

x\x'\y\y’\ 



+ (x + y) log 


P 10 
Pn 


+ cc log 



(4.24) 


There exists a UMP unbiased test, McNemar’s test, which rejects H in favor 
of the alternatives pio < Poi when Y > C(X' + Y,X), where the conditional 
probability of rejection given X' + Y = d and X = x is a for all d and x. Under 
this condition, the numbers of pairs (0, 0) and (1, 1) are fixed, and the only 
remaining variables are Y and X' = d — Y which specify the division of the d 
cases with mixed response between the outcomes (0, 1) and (1, 0). Conditionally, 
one is dealing with d binomial trials with success probability p = poi/(poi +Pio), 
H becomes p =# and the UMP unbiased test reduces to the sign test. [The 
issue of conditional versus unconditional power for this test is discussed by Frisen 
(1980).] 

The situation is completely analogous to that of the sign test in the presence 
of undecided opinions, with the only difference that there are now two types of 
ties, (0, 0) and (1, 1), both of which are disregarded in performing the test. 
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4.10 Problems 

Section 4-1 

Problem 4.1 Admissibility. Any UMP unbiased test <j>o, is admissible in the 
sense that there cannot exist another test <j> 1 which is at least as powerful as <j >o 
against all alternatives and more powerful against some. 

[If (p is unbiased and <j/ is uniformly at least as powerful as cf), then <j>' is also 
unbiased.] 

Problem 4.2 p-values. Consider a family of tests of H : 9 = 9o (or 9 < So), with 
level-a rejection regions S a , such that (a) Pe 0 {X £ S a } for all 0 < a < 1, and 
(b) Sa C S a i for a < o!. If the tests S a are unbiased, the distribution of a under 
any alternative 6 satisfies 

Pg{a < a} > Pe 0 {a < a} — a 

so that it is shifted toward the origin. 


Section 4-2 

Problem 4.3 Let A' have the binomial distribution b(p,n), and consider the 
hypothesis H : p = po at level of significance a. Determine the boundary values 
of the UMP unbiased test for n = 10 with a = .1, po = .2 and with a = .05, 
po = .4, and in each case graph the power functions of both the unbiased and the 
equal-tails test. 


Problem 4.4 Let X have the Poisson distribution P(t), and consider the 
hypothesis H : r = tq. Then condition (4.6) reduces to 


c 2 -1 

E 

x=C i + l 


(x- 1)! 




(Ci - 1)! 


= 1 — a. 


provided C i > 1. 


Problem 4.5 Let T n /9 have a x 2 -distribution with n degrees of freedom. For 
testing H : 9 = 1 at level of significance a = .05, find n so large that the power 
of the UMP unbiased test is > .9 against both 9 > 2 and 9 < How large does 
n have to be if the test is not required to be unbiased? 


Problem 4.6 Suppose X has density (with respect to some measure p) 
pe(x) = C{9) exp[9T(x)]h(x) , 

for some real-valued 9. Assume the distribution of T(X) is continuous under 9 
(for any 9). Consider the problem of testing 9 = 9o versus 9 ^ 9$. If the null 
hypothesis is rejected, then a decision is to be made as to whether 9 > 9o or 
9 < 6*o- We say that a Type 3 (or directional) error is made when it is declared 
that 9 > 9o when in fact 9 < 9o (or vice-versa). Consider a level a test that 
rejects the null hypothesis if T < C\ or T > C 2 for constants Ci < C 2 . Further 
suppose that it is declared that 9 < 9q if T < C\ and 9 > 9q if T > C 2 . 
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(i) If the constants are chosen so that the test is UMPU, show that the Type 3 
error is controlled in the sense that 

sup Pg{Type 3 error is made} < a . (4-25) 

e*e 0 

(ii) If the constants are chosen so that the test is equi-tailed in the sense 

Pe 0 {T(X) < Ci} = Pe 0 {T(X) > C 2 } = a/2 , 
then show (4.25) holds with a replaced by a/2. 

(iii) Give an example where the UMPU level a test has the left side of (4.25) 
strictly > a/2. [Confidence intervals for 9 after rejection of a two-sided test are 
discussed in Finner (1994).] 

Problem 4.7 Let X and Y be independently distributed according to one- 
parameter exponential families, so that their joint distribution is given by 

dPg lt o 2 (x,y) = C( 8 i)e 6lT< ' x ^ dfj,(x)K( 82 )e 62U< ' v ^ du(y). 

Suppose that with probability 1 the statistics T and U each take on at least three 
values and that (a, b) is an interior point of the natural parameter space. Then 
a UMP unbiased test does not exist for testing H : 8 \ = a, 82 = b against the 
alternatives 81 ^ a or 82 ^ b. 13 

[The most powerful unbiased tests against the alternatives 9\ ^ a, 82 ^ b have 
acceptance regions C\ < T{x) < C 2 and h\ < U(y) < K 2 respectively. These 
tests are also unbiased against the wider class of alternatives K : 9i ^ a or 82 7 ^ b 
or both.] 

Problem 4.8 Let (A', Y) be distributed according to the exponential family 

dPg lt g 2 (x,y) = C(9i,0 2 )e eiX+e2y dy,(x,y) . 

The only unbiased test for testing H : 9\ < a, 82 < 6 against K : 9i > a or 82 > b 
or both is y) = a. 

[Take a = b = 0, and let / 3 ( 9 i, 82 ) be the power function of any level-a test. 
Unbiasedness implies /3(0, 62 ) = a for 8 2 < 0 and hence for all 82 , since /3(0, 82 ) is 
an analytic function of 82 . For fixed 82 > 0, 0(8i, 82 ) considered as a function of 
81 therefore has a minimum at 9\ = 0, so that d/3(8 1 , 02 )/< 9 #i vanishes at 9i = 0 
for all positive d 2 , and hence for all 82 - By considering alternatively positive and 
negative values of 82 and using the fact that the partial derivatives of all orders 
of f3(8 1 , 82 ) with respect to #1 are analytic, one finds that for each fixed 82 these 
derivatives all vanish at 6 \ = 0 and hence that the function j3 must be a constant. 
Because of the completeness of (X,Y), ^( 61 , 82 ) = a implies 4>{x,y) = a.] 

Problem 4.9 For testing the hypothesis H : 8 = do, (do an interior point of $2) 
in the one-parameter exponential family of Section 4.2, let C be the totality of 
tests satisfying (4.3) and (4.5) for some —00 < Ci < C 2 < 00 and 0 < 71 , 72 < 1. 


13 For counterexamples when the conditions of the problem are not satisfied, see 
Kallenberg et al. (1984). 
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(i) C is complete in the sense that given any level-a test 0o of H there exists 
0 £ C such that 0 is uniformly at least as powerful as 0o. 

(ii) If 4 >i,4>2 £ C, then neither of the two tests is uniformly more powerful than 
the other. 

(iii) Let the problem be considered as a two-decision problem, with decisions 
do and d\ corresponding to acceptance and rejection of H and with loss 
function L(9,di) = Li(9),i = 0,1. Then C is minimal essentially complete 
provided L\{9) < Lo(9) for all 9 ^ 9o- 

(iv) Extend the result of part (iii) to the hypothesis H' : 9\ < 9 < 92- (For 
more general complete class results for exponential families and beyond, 
see Brown and Marden (1989).) 

[(i): Let the derivative of the power function of 0o at 9q be /3J, 0 (# o) = p- Then 
there exists 0 £ C such that /3j,(#o) = p and 0 is UMP among all tests satisfying 
this condition. 

(ii) : See the end of Section 3.7. 

(iii) : See the proof of Theorem 3.4.2.] 


Section 4-3 

Problem 4.10 Let Xi,..., X n be a sample from (i) the normal distribution 
N(aa, a 2 ), with a fixed and 0 < a < oo; (ii) the uniform distribution U(Q— |, # + 
|),—oo < 9 < oo; (iii) the uniform distribution U(9i, # 2 ), 00 < 9i < 92 < 00 . 
For these three families of distributions the following statistics are sufficient: (i), 
T = (£ Xi, £ X 2 )- (ii) and (iii), T = (min(AT,..., X n ), max(AT,.. ., A„)). The 
family of distributions of T is complete for case (iii), but for (i) and (ii) it is not 
complete or even boundedly complete. 

[(i): The distribution of Xi/^/^2 Xf does not depend on a.] 

Problem 4.11 Let Xi,..., X m and Y \,..., Y n . be samples from N(£, a 2 ) and 
N( £, r 2 ). Then T = (Y1 Xi, Yj, Xf, Yf), which in Example 4.3.3 was seen 
not to be complete, is also not boundedly complete. 

[Let f(t) be 1 or —1 as y — x is positive or not.] 

Problem 4.12 Counterexample. Let X be a random variable taking on the 
values — 1 , 0 , 1 , 2 , ... with probabilities 

P${X s= — 1 } = $■ P e {X = x} = (1 - 6) 2 9 x , x = 0,1,.... 

Then V = {Pe,0 < 9 < 1} is boundedly complete but not complete. [Girschick 
et al. (1946)] 

Problem 4.13 The completeness of the order statistics in Example 4.3.4 re¬ 
mains true if the family T is replaced by the family T\ of all continuous 
distributions. 

[Due to Fraser (1956). To show that for any integrable symmetric function <j>, 
f . .,x n ) dF(xi)... 

dF(x n ) = 0 for all continuous F implies 0 = 0 a.e., replace F by ai-Fj-l-h a n F n , 
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where 0<aj<l,5^a» = l. By considering the left side of the resulting identity 
as a polynomial in the a’s one sees that f (j>(xi ,..., x n ) dFi(xi)... dF n (x„) = 0 
for all continuous Fi. This last equation remains valid if the Fi are replaced by 
I ai (x)F(x), where I ai (x) = 1 if x < a; and = 0 otherwise. This implies that 
4> — 0 except on a set which has measure 0 under F x ... x F for all continuous 

F.} 

Problem 4.14 Determine whether T is complete for each of the following 
situations: 

(i) X\ ,..., X n are independently distributed according to the uniform 
distribution over the integers 1,2, ... ,8 and T = max(A'i, ..., X n ). 

(ii) X takes on the values 1,2,3,4 with probabilities pq, p 2 q, pq 2 , 1 — 2 pq 
respectively, and T = X. 

Problem 4.15 Let X, Y be independent binomial b(p,m) and b(p 2 ,n) respec¬ 
tively. Determine whether (X, Y) is complete when 

(i) m = n = 1, 

(ii) m = 2, n = 1. 

Problem 4.16 Let Xi,... ,X n be a sample from the uniform distribution over 
the integers 1 ,... ,8 and let a be a positive integer. 

(i) The sufficient statistic X( n> is complete when the parameter space is 12 = 
{8 : 8 < a}. 

(ii) Show that X( n \ is not complete when Q = {8 \ 8 > a}, a > 2, and find a 
complete sufficient statistic in this case. 


Section 4-4 

Problem 4.17 Let Xi(i = 1,2) be independently distributed according to dis¬ 
tributions from the exponential families (3.19) with C, Q, T, and h replaced by 
Ci, Qi, Ti, and hi. Then there exists a UMP unbiased test of 

(i) H : < 52 (^ 2 ) — Qi(0i) < c and hence in particular of < 32 (^ 2 ) < Qi(#i); 

(ii) H : 02 ( 82 ) + Qi(0i) < c. 

Problem 4.18 Let X, Y, Z be independent Poisson variables with means A, p, 
v. Then there exists a UMP unbiased test of H : \p < v 2 . 

Problem 4.19 Random sample size. Let N be a random variable with a power- 
series distribution 

P(N = n) — ~prrrr ~, n = 0,1,... (A > 0, unknown). 

C(A) 

When N = n, a sample X\,..., X n from the exponential family (3.19) is observed. 
On the basis of ( N, X \,..., Xn) there exists a UMP unbiased test of H : Q{ 8 ) < 
c. 
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Problem 4.20 Suppose P{I = 1} = p = 1 — P{I = 2}. Given I = i, X ~ 
N(9,Ui), where < a 2 are known. If p = 1/2, show that, based on the data 
(X,I), there does not exist a UMP test of 9 — 0 vs 6 > 0. However, if p is 
also unknown, show a UMPU test exists. [See Examples 10.20-21 in Romano and 
Siegel (1986).] 

Problem 4.21 Measurability of tests of Theorem f.f.l. The function <fi 3 defined 
by (4.16) and (4.17) is jointly measurable in u and t. 

[With Ci = v and C 2 = w, the determining equations for v, 10 , 71,72 are 

Ft(v~) + [1 - Ft{w)]+^[F t (v) - F t (v-)] (4.26) 

+72 [F t (w) — F t (w —)] = a 

and 

Gt(v-) + [1 - Gt(w)] + 71 [G t (w) - G t (v-)] (4.27) 

+72 [G t (w) - Gt(w-)] = a 

where 

/ u ru 

Ct(9i)e eiv dv t (y), G t (u)= C t {9 2 )e e ™ dv t (y), (4.28) 

-OO J — OO 

denote the conditional cumulative distribution function of U given t when 9 = 9\ 
and 9 = 9 2 respectively. 

(1) For each 0 < y < a let v(y,t) = F i ~ 1 (y) and w(y,t) = Fj~ 1 ( 1 — a + y), where 
the inverse function is defined as in the proof of Theorem 4.4.1. Define 71 (y,t) 
and 72 (y, t) so that for v = v(y, t) and w = w(y, t), 

F t (v~) + 7 i[F t (v) - F t (v-)] = y, 

1 — F t (w) + T 2 [F t (w) — F t (w—)] = a — y. 

(2) Let H(y,t) denote the left-hand side of (4.27), with v = v(y,t), etc. Then 
H(0,t ) > a and H(a,t) < a. This follows by Theorem 3.4.1 from the fact that 
u( 0 , t) = —00 and w(a,t) = 00 (which shows the conditional tests corresponding 
to y = 0 and y = a to be one-sided), and that the left-hand side of (4.27) for any 
y is the power of this conditional test. 

(3) For fixed t, the functions 

Hi(y,t) = Gt(v—) + 71 [Gt(v) - G t (v-)j 

and 

H 2 (y,t) = 1 - G t (w) + 72 [Gt(w) - G t {w-)} 

are continuous functions of y. This is a consequence of the fact, which follows from 
(4.28), that a.e. P T the discontinuities and flat stretches of Ft and Gt coincide. 

(4) The function H(y,t) is jointly measurable in y and t. This follows from the 
continuity of H by an argument similar to the proof of measurability of Ft(u ) in 
the text. Define 

y{t) = inf{ 1 / : H(y,t) < a}, 

and let v(t) = v[y(t),t], etc. Then (4.26) and (4.27) are satisfied for all t. The 
measurability of v(t), w(t), 71 (t), and 72 (t) defined in this manner will follow from 



144 4. Unbiasedness: Theory and First Applications 


measurability in t of y{t) and F t 1 [i/(t)]. This is a consequence of the relations, 
which hold for all real c, 

{t : y(t ) < c} = (J {£ : H(r,t ) < a}, 

r<c 

where r indicates a rational, and 

{t ■ F^lyft)] <c} = {t: y(t) - F t {c) < 0}.] 


Problem 4.22 Continuation. The function <j >4 defined by (4.16), (4.18), and 
(4.19) is jointly measurable in u and t. 

[The proof, which otherwise is essentially like that outlined in the preceding 
problem, requires the measurability in z and t of the integral 

9 {z,t)= f udF t (u). 

■ ' — OO 

This integral is absolutely convergent for all t, since Ft is a distribution belonging 
to an exponential family. For any z < 00 , g{z,t) = limg n (z,t), where 


gn(z,t) 




and the measurability of g follows from that of the functions g n . The inequalities 
corresponding to those obtained in step ( 2 ) of the preceding problem result from 
the property of the conditional one-sided tests established in Problem 3.45.] 


Problem 4.23 The UMP unbiased tests of the hypotheses Hi,... ,Hi of The¬ 
orem 4.4.1 are unique if attention is restricted to tests depending on U and the 
T’s. 


Problem 4.24 The singly truncated normal (STN) distribution, indexed by 
parameters v and A has support the positive real line with density 

p(x; v , A) = C(v, A) exp(— vx — \x 2 ) , 

where C(v , A) is a normalizing constant. Based on an i.i.d. sample, show there 
exists a UMPU test of the null hypothesis that the observations are exponential 
against the STN alternative, and describe the form of rejection region as explicitly 
as possible. [See Castillo and Puig (1999).] 


Section 4-5 

Problem 4.25 Negative binomial. Let X,Y be independently distributed ac¬ 
cording to negative binomial distributions Nb(pi,m) and Nb(p 2 ,n) respectively, 
and let qt = 1 — Pi. 


(i) There exists a UMP unbiased test for testing H : 6 = q 2 /qi < do and hence 
in particular H' : pi < P2- 

(ii) Determine the conditional distribution required for testing H' when m = 
n = 1. 
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Problem 4.26 Let A' and Y be independently distributed with Poisson distri¬ 
butions P( A) and P(p). Find the power of the UMP unbiased test of H : p, < A, 
against the alternatives A = .1, p = .2; A = 1, p = 2; A = 10, p = 20; A = .1, 
p = .4; at level of significance a = .1. 

[Since T = X + Y has the Poisson distribution P(A + p), the power is 

t=o 

where j3{t ) is the power of the conditional test given t against the alternative in 
question.] 

Problem 4.27 Sequential comparison of two binomials. Consider two sequences 
of binomial trials with probabilities of success pi and p 2 respectively, and let 
P = (P 2 /<? 2 ) 4- {pi/qi). 

(i) If a < f3, no test with fixed numbers of trials m and n for testing H : p = po 
can have power > /3 against all alternatives with p = pi. 

(ii) The following is a simple sequential sampling scheme leading to the desired 
result. Let the trials be performed in pairs of one of each kind, and restrict 
attention to those pairs in which one of the trials is a success and the other 
a failure. If experimentation is continued until N such pairs have been 
observed, the number of pairs in which the successful trial belonged to the 
first series has the binomial distribution b(ir,N) with n = piq 2 /{piq 2 + 
Piqi) = 1/(1 + p). A test of arbitrarily high power against pi is therefore 
obtained by taking N large enough. 

(iii) If P 1 /P 2 = A, use inverse binomial sampling to devise a test of H : A = Ao 
against K : X > X 0 . 

Problem 4.28 Positive dependence. Two random variables (A', 1') with c.d.f. 
F(x, y ) are said to be positively quadrant dependent if F(x, y) > F(x, oo)F(oo, y) 
for all x, y. 14 For the case that ( X , V) takes on the four pairs of values (0, 0), (0,1), 
(1,0), (1,1) with probabilities poo, Poi, P 10 , Pn, (X,Y) are positively quadrant 
dependent if and only if the odds ratio A = poiPio/pooPn < 1. 

Problem 4.29 Runs. Consider a sequence of N dependent trials, and let A, be 
1 or 0 as the i th trial is a success or failure. Suppose that the sequence has the 
Markov property 15 

P{Xi = 1| Xi,.. .,Xi- 1 } = P{Aj = 

and the property of stationarity according to which P{A t = 1} and P{Xi = 
l\xi-i} are independent of i. The distribution of the A'’s is then specified by the 


14 For a systematic discussion of this and other concepts of dependence, see Tong (1980, 
Chapter 5), Kotz, Wang and Hung (1990) and Yanagimoto (1990). 

15 Statistical inference in these and more general Markov chains is discussed, for 
example, in Bhat and Miller (2002); they provide references at the end of Chapter 
5. 
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probabilities 

Pi = P{Xi = l\xi-i = 1} and po = P{X l = l\xi-i = 0} 
and by the initial probabilities 

7r i = P{X i = 1} and 7To = 1 — 7Ti = P{Xi = 0} 

(i) Stationarity implies that 

Po qi 


7Tl 


Po + qi 


n 0 = 


Po + qi 


(ii) A set of successive outcomes Xi , 27 + 1 ,..., Xi+j is said to form a run of zeros 
if Xi = Xi+i = • • • = Xi+j = 0 , and Xi-i = 1 or i = 1 , and Xi+j+i = 1 
or i + j = N. A run of ones is defined analogously. The probability of any 
particular sequence of outcomes (xi,. . ., xn) is 


v n — v u m — u 

-PoPi qi Qo 


Po + qi 

where m and n denote the numbers of zeros and ones, and u and v the 
numbers of runs of zeros and ones in the sequence. 


Problem 4.30 Continuation. For testing the hypothesis of independence of the 
X’s, H : po = p i, against the alternatives K : po < Pi, consider the run test, 
which rejects H when the total number of runs R = U + V is less than a constant 
C(m) depending on the number m of zeros in the sequence. When R = C(m), 
the hypothesis is rejected with probability 7 (m), where C and 7 are determined 
by 

Ph{R < C(m)\m} + 7 (m)PH{R = C(m)\m} = a. 

(i) Against any alternative of K the most powerful similar test (which is at 
least as powerful as the most powerful unbiased test) coincides with the 
run test in that it rejects H when R < C(m). Only the supplementary 
rule for bringing the conditional probability of rejection (given m) up to a 
depends on the specific alternative under consideration. 


(ii) The run test is unbiased against the alternatives K. 


(iii) The conditional distribution of R given m, when H is true, is 16 


P{R = 2r} 


P{R = 2r + 1} 


r\ (m — 1\ (n — 1\ 

Z V r-1 ) Vr-1/ 

/ m-\-n\ 5 

V m J 

(ri)r: 1 ) + ("y 1 ) (r-i) 

/ m+n\ 

V m J 


[(i): Unbiasedness implies that the conditional probability of rejection given m is 
a for all m. The most powerful conditional level-a test rejects H for those sample 


16 This distribution is tabled by Swed and Eisenhart (1943) and Gibbons and 
Chakraborti (1992); it can be obtained from the hypergeometric distribution [Guenther 
(1978)]. For further discussion of the run test, see Lou (1996). 
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sequences for which A (u,v) = (po / pi) v (qi / qo) u is too large. Since po < pi and 
qi < qo and since |v — u\ can only take on the values 0 and 1, it follows that 

A(l,l)> A(l,2), A(2,1) > A(2, 2) > A(2, 3), A(3,2)>---. 

Thus only the relation between A (i, i + 1) and A (i + 1, i) depends on the specific 
alternative, and this establishes the desired result. 

(ii) : That the above conditional test is unbiased for each m is seen by writing its 
power as 

f3(po,pi\ni) = (1 - 7 )P{R < C(m)\m} + 7 P{R < C{m)\m}, 

since by (i) the rejection regions R < C(m ) and R < C(m) + 1 are both UMP at 
their respective conditional levels. 

(iii) : When H is true, the conditional probability given m of any set of m zeros 
and n ones is 1/ ( m J[ n ) • The number of ways of dividing n ones into r groups is 
( n “J), and that of dividing m zeros into r + 1 groups is ( m ~ 1 )- The conditional 
probability of getting r + 1 runs of zeros and r runs of ones is therefore 

f m+n\ 

V m ) 

To complete the proof, note that the total number of runs is 2r + 1 if and only 
if there are either r + 1 runs of zeros and r runs of ones or r runs of zeros and 
r + 1 runs of ones.] 

Problem 4.31 (i) Based on the conditional distribution of X 2 , • • •, X n given 

X\ = xi in the model of Problem 4.29, there exists a UMP unbiased test 
of H : po = pi against po > p 1 for every a. 

(ii) For the same testing problem, without conditioning on X\ there exists a 
UMP unbiased test if the initial probability 7ri is assumed to be completely 
unknown instead of being given by the value stated in (i) of Problem 4.29. 
[The conditional distribution of X 2 ,..., A'„ given xi is of the form 

C(xi;po,pi,qo,qi)Pi 1 Po° 9i 1 %° (yi,V 2 , Zi, z 2 ), 

where y 1 is the number of times a 1 follows a 1, t/o the number of times a 1 follows 
a 0, and so on, in the sequence xi,X 2 , ■ ■ ■ ,X n . [See Billingsley (1961, p. 14).] 

Problem 4.32 Rank-sum test. Let Yi,..., Yn be independently distributed 
according to the binomial distributions b(pi, m),i = 1,... ,N where 

1 

— 1 + e _ (“+/3a;i)' 

This is the model frequently assumed in bioassay, where Xi denotes the dose, or 
some function of the dose such as its logarithm, of a drug given to rq experimental 
subjects, and where Y) is the number among these subjects which respond to the 
drug at level Xi. Here the Xi are known, and a and /3 are unknown parameters. 

(i) The joint distribution of the Y’s constitutes an exponential family, and 
UMP unbiased tests exist for the four hypotheses of Theorem 4.4.1, concern 
both a and (3. 
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(ii) Suppose in particular that Xi = A i, where A is known, and that m — 1 
for all i. Let n be the number of successes in the N trials, and let these 
successes occur in the sist, S 2 nd,..., s n th trial, where si < S 2 < • ■ ■ < s n - 
Then the UMP unbiased test for testing H : /3 = 0 against the alternatives 
j3 > 0 is carried out conditionally, given n, and rejects when the rank sum 

i Si is too large. 

(iii) Let Yi, ..., Ym and Z\, ..., Zn ■ be two independent sets of experiments 
of the type described at the beginning of the problem, corresponding, say, 
to two different drugs. If Yi is distributed as b(pi,rm) and Zj as b(nj,nj), 
with 

1 _ 1 

P* 14- e-(“+0“») ’ ^ a + e~(' y+l3v ^ ’ 

then UMP unbiased tests exist for the four hypotheses concerning 7 — a 
and 8-/3. 


Section 4-8 

Problem 4.33 In a 2 x 2 x 2 table with mi = 3, ni = 4; m 2 = 4, ri 2 — 4; 
and U = 3, t[ = 4, t 2 = t '2 = 4, determine the probabilities that P(Y 1 + Y 2 < 
K\Xi + Yi = ti,i = 1, 2 ) for k = 0,1, 2 , 3. 


Problem 4.34 In a 2 x 2 x K table with A*, = A, the test derived in the text 
as UMP unbiased for the case that the B and C margins are fixed has the same 
property when any two, one, or no margins are fixed. 


Problem 4.35 The UMP unbiased test of H : A =s 1 derived in Section 4.8 
for the case that the B- and C-margins are fixed (where the conditioning now 
extends to all random margins) is also UMP unbiased when 

(i) only one of the margins is fixed; 

(ii) the entries in the 4A' cells are independent Poisson variables with means 
A abc, ■ ■ ■, and A is replaced by the corresponding cross-ratio of the A’s. 

Problem 4.36 Let Xijki ( i,j,k = 0,1, l = 1 denote the entries in a 

2x2x2 x L table with factors A, B, C, and D, and let 

P _ PaB c CDi ^ABCDi ^ABCDi ^ABCDi 

Pabcdi Pabcdi P'abcdi P > Abcd 1 

Then 

(i) under the assumption T; = T there exists a UMP unbiased test of the 
hypothesis T < To to for any fixed Po; 

(ii) When 1 = 2, there exists a UMP unbiased test of the hypothesis Ti = T 2 
—in both cases regardless of whether 0, 1, 2 or 3 of the sets of margins are 
fixed. 
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Section 4-9 

Problem 4.37 In the 2x2 table for matched pairs, show by formal computation 
that the conditional distribution of Y given X' + Y = d and X = x is binomial 
with the indicated p. 

Problem 4.38 Consider the comparison of two success probabilities in (a) the 
two-binomial situation of Section 4.5 with m = n, and (b) the matched-pairs 
situation of Section 4.9. Suppose the matching is completely at random, that is, 
a random sample of 2n subjects, obtained from a population of size N(2n < N), 
is divided at random into n pairs, and the two treatments B and B c are assigned 
at random within each pair. 

(i) The UMP unbiased test for design (a) (Fisher’s exact test) is always more 
powerful than the UMP unbiased test for design (b) (McNemar’s test). 

(ii) Let Xi (respectively Y)) be 1 or 0 as the 1st (respectively 2nd) member of 
the i th pair is a success or failure. Then the correlation coefficient of Xi 
and Yi can be positive or negative and tends to zero as N —> oo. 

[(ii): Assume that the fcth member of the population has probability of success 
under treatment A and P^S under A.] 

Problem 4.39 In the 2x2 table for matched pairs, in the notation of Section 
4.9, the correlation between the responses of the two members of a pair is 

Pll — 7T17T2 

P = , 

v 7Tl(l — 7Tl)7r 2 (l — 7T 2 ) 

For any given values of 7 ri < 7 t 2 , the power of the one-sided McNemar test of 
H : 7 Ti = 7 r 2 is an increasing function of p. 

[The conditional power of the test given X + Y = d, X = x is an increasing 
function p = p 0 i/(p 0 i +pio)-] 

Note. The correlation p increases with the effectiveness of the matching, and 
McNemar’s test under (b) of Problem 4.38 soon becomes more powerful than 
Fisher’s test under (a). For detailed numerical comparisons see Wacholder and 
Weinberg (1982) and the references given there. 


4.11 Notes 

The closely related properties of similarity (on the boundary) and unbiasedness 
are due to Neyman and Pearson (1933, 1936), who applied them to a variety of 
examples. It was pointed out by Neyman (1937) that similar tests could be ob¬ 
tained through the construction method now called Neyman structure. Theorem 
4.3.1 is due to Ghosh (1948) and Hoel (1948). The concepts of completeness and 
bounded completeness, and the application of the latter to Theorem 4.4.1, were 
developed by Lehmann and Scheffe (1950). 

The sign test, proposed by Arbuthnot (1710) to test that the probability of a 
male birth is 1/2, may be the first significance test in the literature. The exact 
test for independence in 2 by 2 table is due to Fisher (1934). 
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Unbiasedness: Applications to Normal 
Distributions; Confidence Intervals 


5.1 Statistics Independent of a Sufficient Statistic 

A general expression for the UMP unbiased tests of the hypotheses Hi : 9 < Go 
and H 4 , : 9 = do in the exponential family 

dPe,-e(x) = C(0, ft) exp d(j,(x) (5.1) 

was given in Theorem 4.4.1 of the preceding chapter. However, this turns out 
to be inconvenient in the applications to normal and certain other families of 
continuous distributions, with which we shall be concerned in the present chapter. 
In these applications, the tests can be given a more convenient form, in which 
they no longer appear as conditional tests in terms of U given t, but are expressed 
unconditionally in terms of a single test statistic. The following are three general 
methods of achieving this. 

(i) In many of the problems to be considered below, the UMP unbiased test 
4>o, is also UMP invariant, as will be shown in Chapter 6. From Theorem 6.5.3, 
it is then possible to conclude that <f >o is UMP unbiased. This approach, in which 
the latter property must be taken on faith during the discussion of the test in 
the present chapter, is the most economical of the three, and has the additional 
advantage that it derives the test instead of verifying a guessed solution as is the 
case with methods (ii) and (iii). 

(ii) The conditional descriptions (4.12), (4.14), and (4.16) can be replaced 
by equivalent unconditional ones, and it is then enough to find an unbiased test 
which has the indicated structure. This approach is discussed in Pratt (1962). 

(iii) Finally, it is often possible to show the equivalence of the test given by 
Theorem 4.4.1 to a test suspected to be optimal, by means of Theorem 5.1.2 
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below. This is the course we shall follow here; the alternative derivation (i) will 
be discussed in Chapter 6 . 

The reduction by method (iii) depends on the existence of a statistic V = 
h(U,T ), which is independent of T when 9 = 9 o, and which for each fixed t is 
monotone in U for H i and linear in U for H±. The critical function 0i, for testing 
Hi then satisfies 



' 1 

when 

v > Co, 


<t>{v) = < 

70 

when 

v = Co, 

(5.2) 


0 

V. 

when 

v < Co, 



where Co and 70 are no longer dependent on t, and are determined by 

E$ 0 <j>i(V) = a. (5.3) 

Similarly the test 0 4 of H 4 reduces to 



' 1 

when 

v < Ci or v > C 2 , 


0 (v) = < 

7 i 

when 

v = Ci, i = 1,2, 

(5.4) 


0 

V. 

when 

Ci < v < C 2 , 



where the C’s and 7 ’s are determined by 


E eo [<t>4{V)]=a 

(5.5) 

and 


Ee 0 [VMV)]=aE eo (V). 

(5.6) 

The corresponding reduction for the hypotheses H 2 : 9 < 9i, or 9 > 62 and 
H 3 : 9i < 9 < 62 requires that V be monotone in U for each fixed t, and be 
independent of T when 9 = 9 1 and 9 = 62 - The test 0 3 is then given by (5.4) 
with the C’s and 7 ’s determined by 

E$ 103(10 = Eg 2 03(10 = o. 

(5.7) 


The test for H 2 as before has the critical function 

4 > 2 (v; a) = 1 - 0 3 (v; 1 - a). 

This is summarized in the following theorem. 

Theorem 5.1.1 Suppose that the distribution of X is given by (5.1) and that 
V = h(U,T) is independent of T when 9 = 9 q. Then 0i is UMP unbiased for 
testing Hi provided the function h is increasing in u for each t, and 0 4 is UMP 
unbiased for H 4 provided 

h(u , t ) = a(t)u + b(t) with aft) > 0. 

The tests 0 2 and 0 3 , are UMP unbiased for H 2 and H 3 if V is independent of T 
when 9 = 9 1 and 62 , and if h is increasing in u for each t. 

Proof. The test of Hi defined by (4.12) and (4.13) is equivalent to that given 
by (5.2), with the constants determined by 

Pe 0 {V > C 0 (t) | t} +Mt)Pe 0 {V = C 0 (t) \t} = a. 
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By assumption, V is independent of T when 8 = do, and Co and 70 therefore do 
not depend on t. This completes the proof for Hi, and that for H2 and H3 is 
quite analogous. 

The test of Hi given in Section 4.4 is equivalent to that defined by (5.4) with 
the constants Ci and 7 i determined by Eg 0 [04 (U, t) \ t] = a and 


Eg 0 


Mv,t) 


y - m 

ci{t) 


which reduces to 


aEg 0 


V - b(t) 
a(t) 


Eg 0 [VMV,t) I t] = aEg 0 [V I t\. 


Since V is independent of T for 9 = 8 0 , so are the C’s and 7 ’s as was to be 
proved. ■ 

To prove the required independence of V and T in applications of Theorem 
5.1.1 to special cases, the standard methods of distribution theory are available: 
transformation of variables, characteristic functions, and the geometric method. 
Alternatively, for a given model {Pg,'9 € w}, suppose V is any statistic whose dis¬ 
tribution does not depend on such a statistic is said to be ancillary. Then, the 
following theorem gives sufficient conditions to show V and T are independent. 


Theorem 5.1.2 (Basu) Let the family of possible distributions of X be V = 
{P-g, & £ to}, let T be sufficient for V, and suppose that the family V T of distri¬ 
butions ofT is boundedly complete. If V is any ancillary statistic for V , then V 
is independent ofT. 

Proof. For any critical function cf, the expectation E#(f>(V) is by assumption 
independent of i9. It therefore follows from Theorem 4.3.2 that E[<j>(V) \ t] is 
constant (a.e. V T ) for every critical function (j >, and hence that V is independent 
ofT. ■ 


Corollary 5.1.1 Let V be the exponential family obtained from (5.1) by letting 
9 have some fixed value. Then a statistic V is independent ofT for all D provided 
the distribution of V does not depend on d. 

Proof. It follows from Theorem 4.3.1 that P T is complete and hence boundedly 
complete, and the preceding theorem is therefore applicable. ■ 

Example 5.1.1 Let Xi,..., X n , be independently, normally distributed with 
mean £ and variance a 2 . Suppose first that o 2 is fixed at og. Then the assumptions 
of Corollary 5.1.1 hold with T = X = ^2 Xi/n and D proportional to £. Let / be 
any function satisfying 

f(x 1 +c,...,x n + c) = f(x 1 ,.. .,x n ) for all real c. 

If 

V = f(X !,..., X n ), 

then also V = f(X 1 — £,..., X n — (). Since the variables Xi — l; are distributed as 
N( 0, oq), which does not involve the distribution of V does not depend on £. It 
follows from Corollary 5.1.1 that any such statistic V, and therefore in particular 
V = ^2(Xi — X) 2 , is independent of X. This is true for all a. 
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Suppose, on the other hand, that £ is fixed at £o- Then Corollary 5.1.1 applies 
with T = ^2(Xi — £o) 2 and i9 = — 1/2<j 2 . Let / be any function such that 

f(c: El,... , cXn) = f(x 1 ,..., Xn) for all c > 0, 

and let 

v = f(x 1 -£ 0 ,...,x n -i; 0 ). 

Then V is unchanged if each Xi — £o is replaced by (Xi — £o)/<7, and since 
these variables are normally distributed with zero mean and unit variance, the 
distribution of V does not depend on a. It follows that all such statistics V, and 
hence for example 

■Xj-jp ariH X - Co 
x/E(A'i-X)2 an ^(X.-Co) 2 ’ 

are independent of E(Ai — Co) 2 - This, however, does not hold for all £, but only 
when £ = £o- ■ 


Example 5.1.2 Let Ui/a 2 and U 2 /a 2 be independently distributed according 
to x 2_ distributions with /i and f 2 degrees of freedom respectively, and suppose 
that (J 2 /(Ti = a. The joint density of the U’s is then 




exp 


1 

2 


(aui + u 2 ) 


so that Corollary 5.1.1 is applicable with T = aUi + U 2 and $ = — 1/2<T2- Since 
the distribution of 


v u 2 

v = irr a 


U 2 /a% 

Ui/af 


does not depend on a 2 , V is independent of aU\ + U 2 . For the particular case 
that <72 = a 1 , this proves the independence of U 2 /U\ and U\ + U 2 . ■ 


Example 5.1.3 Let (Xi,. .., X n ) and (Y \...., Y n ) be samples from normal dis¬ 
tributions X(£,<t 2 ) and N{t),t 2 ) respectively. Then T = (X, E Xf , Y, E Y 2 ) is 
sufficient for (£, a 2 , r/, r 2 ) and the family of distributions of T is complete. Since 

t r E (A-,-x)(y,-r) 

PE (A; - X)*(Yi - Y) 2 

is unchanged when X \ and Yj are replaced by (Xi — t\)/o and (Yi — ii)/r, the 
distribution of V does not depend on any of the parameters, and Theorem 5.1.2 
shows V to be independent of T. ■ 


5.2 Testing the Parameters of a Normal 
Distribution 

The four hypotheses <r < no, <7 > 00 , £ < £ 0 , £ > Co concerning the variance a 2 
and mean f of a normal distribution were discussed in Section 3.9, and it was 
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pointed out there that at the usual significance levels there exists a UMP test 
only for the first one. We shall now show that the standard (likelihood-ratio) 
tests are UMP unbiased for the above four hypotheses as well as for some of the 
corresponding two-sided problems. 

For varying £ and < 7 , the densities 



of a sample Ai,...,A„ from N(£,a 2 ) constitute a two-parameter exponential 
family, which coincides with (5.1) for 

e = U(X) = X>?, T(x)=x=^. 

By Theorem 4.4.1, there exists therefore a UMP unbiased test of the hypothesis 
9 >9 o, which for 9q = —1/2<jq is equivalent to H : a > ao- The rejection region 
of this test can be obtained from (4.12), with the inequalities reversed because 
the hypothesis is now 9 > 9q. In the present case this becomes 

X>?< co(x) 

where 

Pctq {T, X 2 < Co(x) | xj = a. 

If this is written as 


x 2 — nx 2 < Cq(x) 


it follows from the independence of X 2 — nX 2 = ^2(Xi—X) 2 and A' (Example 
5.1.1) that Cq(x) does not depend on x. The test therefore rejects when XX 3 -* — 
x) 2 < Cq, or equivalently when 


--2- < G °- 


(5.9) 


with Co determined by P<T 0 {E(Ai — A) 2 /ao < Co} = a. Since E(A* — A') 2 /cr 2 
has a ^-distribution with n — 1 degrees of freedom, the determining condition 
for Co is 



(5.10) 


where Xn-i denotes the density of a variable with n — 1 degrees of freedom. 

The same result can be obtained through Theorem 5.1.1. A statistic V = 
h(U, T ) of the kind required by the theorem - that is, independent of A' for 
a = <7o, and all £ - is 


V = y(Xi - X) 2 = U- nT 2 . 

This is in fact independent of X for all £ and <r 2 . Since h(u,t) is an increasing 
function of u for each t, it follows that the UMP unbiased test has a rejection 
region of the form V < Cq. 
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This derivation also shows that the UMP unbiased rejection region for H : a < 
cri or <7 > 02 is 

Ci <Y^^i - xf < C 2 (5.11) 


where the C’s are given by 

rCi/vl 


>Ci/a\ 


Xn-i(y) dy = 


fC -2 / O 


i Cl Ig\ 


Xn-1 (y) dy = a. 


(5.12) 


Since h(u, t ) is linear in u, it is further seen that the UMP unbiased test of 
H : a = ao, has the acceptance region 


a < 2 < <r 2 


(5.13) 


with the constants determined by 


rC', 


>C'i 


xl-i(y) dy = 


n — 1 


JCL 


VXn-i(y)dy = 1 - a. 


(5.14) 


This is just the test obtained in Example 4.2.2 with JJ(®i — *) 2 * rl place of 
JO Xi and n — 1 degrees of freedom instead of n, as could have been foreseen. 
Theorem 5.1.1 shows for this and the other hypotheses considered that the UMP 
unbiased test depends only on V. Since the distributions of V do not depend on 
and constitute an exponential family in a, the problems are thereby reduced 
to the corresponding ones for a one-parameter exponential family, which were 
solved previously. 

The power of the above tests can be obtained explicitly in terms of the y 2- 
distribution. In the case of the one-sided test (5.9) for example, it is given by 


/ 3{a) = Pa 


E(*i - X ) 2 < Coal \ 

a 2 ~ a 2 j 



Xn-1 (y) dy. 


The same method can be applied to the problems of testing the hypotheses 
£ < £o against £ > £o and £ = £o against £ ^ £o- As is seen by transforming 
to the variables Xi — £o, there is no loss of generality in assuming that £o = 0. 
It is convenient here to make the identification of (5.8) with (5.1) through the 
correspondence 




U(x) = x, T(x) = y^z 2 - 


Theorem 4.4.1 then shows that UMP unbiased tests exist for the hypotheses 9 < 0 
and 9 = 0, which are equivalent to £ < 0 and £ = 0. Since 


VEiXi-X) 2 VT - nU 2 

is independent of T = JJ Xf when f = 0 (Example 5.1.1), it follows from Theorem 
5.1.1 that the UMP unbiased rejection region for H : £ < 0 is V > Co or 
equivalently 


t(x) > Co, 


(5.15) 
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where 


t(x) = 


sjnx 



(5.16) 


In order to apply the theorem to H' : £ = 0, let W = XXf. This is 
also independent of X) Xf when £ = 0, and in addition is linear in U = X. The 
distribution of W is symmetric about 0 when £ = 0, and conditions (5.4), (5.5), 
(5.6) with W in place of V are therefore satisfied for the rejection region |w| > C' 
with P 5= o{|IU| > C'} = a. Since 

\/1 — nW 2 (x) 

the absolute value of t(x) is an increasing function of |IU(*)|, and the rejection 
region is equivalent to 


\t(x)\ > C. 


(5.17) 


From (5.16) it is seen that t(X) is the ratio of the two independent random 
y/nX/a and s/YliXi — A') 2 /(n — 1 )<t 2 . The denominator is distributed as the 
square root of a x 2 -variable with n — 1 degrees of freedom, divided by n — 1; the 
distribution of the numerator, when £ = 0, is the normal distribution IV(0,1). 
The distribution of such a ratio is Student’s t-distribution with n — 1 degrees of 
freedom, which has probability density (Problem 5.3) 




1 r(§n) 1 

\J7 r(n — 1) r |(n— 1) ^ 


(5.18) 


The distribution is symmetric about 0, and the constants Co and C of the one- 
and two-sided tests are determined by 

f°° f°° Ct 

/ tn-i(y)dy = a and / t n -i{y) dy = (5.19) 

JCo Jc i 

For £ ^ 0, the distribution of t(X) is the so-called noncentral t-distribution, 
which is derived in Problem 5.3. Some properties of the power function of the one- 
and two-sided f-test are given in Problems 5.1, 5.2, and 5.4. We note here that the 
distribution of f(A'), and therefore the power of the above tests, depends only on 
the noncentrality parameter 5 = yThis is seen from the expression of the 
probability density given in Problem 5.3, but can also be shown by the following 
direct argument. Suppose that /<j' = £/a ^ 0, and denote the common value 
of £'/£ and o' /a by c, which is then also different from zero. If X[ = cXi and the 
Xi are distributed as 7V(£, o 2 ), the variables X[ have distribution N(^' , cr 72 ). Also 
t(X) = t(X'), and hence t{X') has the same distribution as t(X), as was to be 
proved. [Tables of the power of the t-test are discussed, for example, in Chapter 
31, Section 7 of Johnson, Kotz and Balakrishnan (1995, Vol. 2).] 

If denotes any alternative value to £ = 0, the power /3(£, o) = f(S) depends 
on o. As a —» oo, <5 —> 0, and 


-»■ /(0) = /3(0, o) = a, 

since / is continuous by Theorem 2.7.1. Therefore, regardless of the sample size, 
the probability of detecting the hypothesis to be false when £ > >0 cannot be 
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made > (3 > a for all a. This is not surprising, since the distributions N(0,a 2 ) 
and jV(£i,cr 2 ) become practically indistinguishable when a is sufficiently large. 
To obtain a procedure with guaranteed power for £ > £i, the sample size must be 
made to depend on a. This can be achieved by a sequential procedure, with the 
stopping rule depending on an estimate of cr, but not with a procedure of fixed 
sample size. (See Problems 5.23 and 5.25.) 

The tests of the more general hypotheses £ < £o and £ = £o are reduced to 
those above by transforming to the variables X\ — £o- The rejection regions for 
these hypotheses are given as before by (5.15), (5.17), and (5.19), but now with 


t(x) 


Vn{x-£ 0 ) 


It is seen from the representation of (5.8) as an exponential family with 6 = 
nfi/a 2 that there exists a UMP unbiased test of the hypothesis a < £/cr 2 < 6, 
but the method does not apply to the more interesting hypothesis a < £ < 6; 1 
nor is it applicable to the corresponding hypothesis for the mean expressed in 
(j-units: a < £/<r < b, which will be discussed in Chapter 6. The dual equivalence 
problem of testing £/<r ^ [a, 6] is treated in Brown, Casella and Hwang (1995), 
Brown, Hwang, and Munk (1997) and Perlman and Wu (1999). 

When testing the mean £ of a normal distribution, one may from extensive past 
experience believe cr to be essentially known. If in fact a is known to be equal to 
<to, it follows from Problem 3.1 that there exists a UMP test (j>o of H : £ < £o, 
against K : £ > £o, which rejects when (X — £o)/ao is sufficiently large, and this 
test is then uniformly more powerful than the t- test (5.15). On the other hand, 
if the assumption a = ao is in error the size of <j >o will differ from a and may 
greatly exceed it. Whether to take such a risk depends on one’s confidence in 
the assumption and the gain resulting from the use of <j >o when a is equal to cro- 
A measure of this gain is the deficiency d of the t-test with respect to fio, the 
number of additional observations required by the f-test to match the power of <j>o 
when a = a o. Except for very small n, d is essentially independent of sample size 
and for typical values of a is of the order of 1 to 3 additional observations. [For 
details see Hodges and Lehmann (1970). Other approaches to such comparisons 
are reviewed, for example, in Rothenberg (1984).] 


5.3 Comparing the Means and Variances of Two 
Normal Distributions 

The problem of comparing the parameters of two normal distributions arises in 
the comparison of two treatments, products, etc., under conditions similar to 
those discussed at the beginning of Section 4.5. We consider first the comparison 
of two variances a 2 and r 2 , which occurs for example when one is concerned with 
the variability of analyses made by two different laboratories or by two different 
methods, and specifically the hypotheses H : t 2 jo 2 < Ao and H' : t 2 /ct 2 = Aq. 


1 This problem is discussed ill Section 3 of Hodges and Lehmann (1954). 
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Let A' = (Ai,...,A m ) and Y = (Yi,...,Y n ) be samples from the normal 
distributions N(£,o 2 ) and N(r/,T 2 ) with joint density 

•. ( 1 2 1 2 . rn£, - . ny _ 

C(£, V , a, T)expl-—^x i -—^ Vi + -fX + ~^y 


to£ 


2 a 2 1 2r 2 *—' " a- 

This is an exponential family with the four parameters 


e = ~ 


1 

2r 


0i = - 


2a 2 ’ 


02 = 


nr/ 


0s = 


and the sufficient statistics 

U = J2 Y ^ T 1 = Y / Xf, T 2 = Y , T 3 =A. 

It can be expressed equivalently (see Lemma 4.4.1) in terms of the parameters 

r = ~2^ + 2ZW 2 ’ (* = 1,2,3) 

and the statistics 

U* =J2 Y ?; T * =E^ + E(EE T 2*=E T s *=X. 


The hypotheses 9* < 0 and = 0, which are equivalent to and H' respectively, 
therefore possess UMP unbiased tests by Theorem 4.4.1. 

When r 2 = Aoa 2 , the distribution of the statistic 

= E(E - n 2 /Ao = E (Yj-Yf/r 2 
E (A;-A) 2 E(AL--Y) 2 /^ 2 

does not depend on a, £, or r], and it follows from Corollary 5.1.1 that V is 
independent of (Tf ,T£ ,T£). The UMP unbiased test of H is therefore given by 
(5.2) and (5.3), so that the rejection region can be written as 

E(*i-WAo(n-l) 

v^, v --LT - G °' ( 5 -2U) 

E(Ai - A) 2 /(m- 1) 

When t 2 = Aoa 2 , the statistic on the left-hand side of (5.20) is the ratio of the 
two independent \ 2 variables E ( Y i ~ Y) 2 /t 2 and E(A”i — A") 2 /a 2 , each divided 
by the number of its degrees of freedom. The distribution of such a ratio is the 
F-distribution with n — 1 and to — 1 degrees of freedom, which has the density 


l*n-l,m-l(y) — 


T l(m + n —2) 

r[l(m-l)]r[i(n-l)' 


n — 1 
to — 1 


5 («—!) 


(5.21) 


y 


i(n-l)-l 


(i + ^i v) 


±(m+n —2) 


The constant Co of (5.20) is then determined by 

F„-i, m -i{y) dy = a. 


L 


(5.22) 


Co 

In order to apply Theorem 5.1.1 to H' let 

E(E - Y ) 2 /Ao 


w = 


E(A< - A) 2 + (1/Ao) E(E - Y ) 2 ' 
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This is also independent of T* = (Tj ,T£,T£) when r 2 = Ao<r 2 , and is linear in 
U*. The UMP unbiased acceptance region of H' is therefore 

Ci<w<c 2 ( 5 . 23 ) 

with the constants determined by (5.5) and (5.6) where V is replaced by W. On 
dividing numerator and denominator of W by <r 2 it is seen that for r 2 = Ao<r 2 , 
the statistic IT is a ratio of the form Wi/{W\ + W 2 ), where Wi and W 2 are 
independent \ 2 variables with n — 1 and m — 1 degrees of freedom respectively. 
Equivalently, W = Y/( 1 + Y ), where Y = Wi/W 2 and where ( m — 1 )Y/(n — 1) 
has the distribution F„- i, m _i. The distribution of W is the beta-distribution 2 
with density 


(5.24) 


B 




-nM = 


r 

\{m + n 

- 2)' 


r 

§(m- 1 ) r 

|(n- 

-1)’ 


0 < w < 1. 


The conditions (5.5) and (5.6), by means of the relations 

n — 1 


E(W) = 


m + n — 2 


and 


n — 1 


^ Un _ 1 ): i m _ 1 } (w) = m + n 2 B i (n +i),i (m -i)H, 


become 






'®i(n+ 1 ),i(m-l)( w ) dw 


= 1 — a. 


(5.25) 


The definition of V shows that its distribution depends only on the ratio r 2 /a 2 , 
and so does the distribution of W. The power of the tests (5.20) and (5.23) is 
therefore also a function only of the variable A = r 2 /a 2 ; it can be expressed 
explicitly in terms of the E-distribution, for example in the first case by 


/3(A) 


EK-m^n-D > CoAol 
T,(Xi - X) 2 /a 2 {m -1) ~ A / 


= / F n -i,m-i{y)d,y. 

Jc 0 A 0 /A 

The hypothesis of equality of the means £, rj of two normal distributions with 
unknown variances o 2 and r 2 , the so-called Behrens-Fisher problem, is not acces¬ 
sible by the present method. (See Example 4.3.3; for a discussion of this problem, 
Section 6.6, Section 11.3.1 and Example 13.5.4.) We shall therefore consider only 


2 The relationship W = V /(1 +Y) shows the F- and beta-distributions to be equiva¬ 
lent. Tables of these distributions are discussed in Chapters 24 and 26 of Johnson, Kotz 
and Balakrishnan (1995. Vol. 2). Critical values of F are tabled by Mardia and Zemroch 
(1978), who also provide algorithms for the associated computations. 



160 


5. Unbiasedness: Applications to Normal Distributions 


the simpler case in which the two variances are assumed to be equal. The joint 
density of the A'’s and V’s is then 


C(£, » 7 , er) exp 


■^(E^+E^E^E 


Vi 


which is an exponential family with parameters 

0 =\, *i = 4> ^ = 

<j z <j z 2 a z 

and the sufficient statistics 

u = J2 Y n T i = E- Yi ^ = E X <+EE 

For testing the hypotheses 

H : rt - £ < 0 and H' : t) - £ = 0 


(5.26) 


it is more convenient to represent the densities as an exponential family with the 
parameters 


9 * 



mlI + nr] 

(■m + n)a 2 ’ 


r 2 = i? 2 


and the sufficient statistics 


U* =Y - X , 7j* = mX + nh, T 2 * = ^ X? + ^ Y) 2 . 


That this is possible is seen from the identity 

mtx + nyy = + (™x + ny)(mj + ntj) _ 

T 4- 1 m + n 

m n 

It follows from Theorem 4.4.1 that UMP unbiased tests exist for the hypotheses 
6 * < 0 and 6 * = 0, and hence for H and H'. 

When rj = £, the distribution of 


V 


_ Y — X _ 


u* 

JtZ -Tj * 2 - 

V z m-\-n 1 m+n 

does not depend on the common mean £ or on a, as is seen by replacing X, with 
(A, — £)/<j and Yj with (Yj — £)/<r in the expression for V, and V is independent 
of (T* ,T 2 ). The rejection region of the UMP unbiased test of H can therefore be 
written as V > Co or 

t(X, Y) > Co, (5.27) 

where 


t(X,Y) 


\/[E(+ - A T + EW - Y) 2 ] /(m + n- 2)' 


(5.28) 
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The statistic t(X, Y ) is the ratio of the two independent variables 
Y - X 




and 


- X)* + E(E - Y) 2 

(m + n — 2 )a 2 


The numerator is normally distributed with mean (r/ — 0/v m_1 + n~ 1 a and unit 
variance; the square of the denominator as a % 2 variable with m + n — 2 degrees 
of freedom, divided by m + n — 2. Hence t(X, Y) has a noncentral f-distribution 
with m + n — 2 degrees of freedom and noncentrality parameter 


6 = 




When in particular r/ — £ = 0, the distribution of t(X, Y) is Student’s 
t-distribution, and the constant Co is determined by 


r oo 

Jc 0 


t m +n- 2 (y) dy = a. 


(5.29) 


As before, the assumptions required by Theorem 5.1.1 for H' are not satisfied by 
V itself but by a function of V, 

Y -X 


W = 


Exf + E y?- 


2 _ {T.Xi+T.X) 


which is related to V through 


V = 


W 




1 _ nm^\y2 

m+n 


Since IT is a function of V, it is also independent of (Tj*, ) when r/ = in 
addition it is a linear function of U* with coefficients dependent only on T*. 
The distribution of W being symmetric about 0 when r) — £, it follows, as in 
the derivation of the corresponding rejection region (5.17) for the one-sample 
problem, that the UMP unbiased test of H' rejects when |1T| is too large, or 
equivalently when 


\t(X,Y)\>C. 


(5.30) 


The constant C is determined by 


f 

J c 


tm+n — 2 (i/) dy — 


The power of the tests (5.27) and (5.30) depends only on (77 — C)/ 0 " an< i is given 
in terms of the noncentral t-distribution. Its properties are analogous to those of 
the one-sample t-test (Problems 5.1, 5.2, and 5.4). 


5.4 Confidence Intervals and Families of Tests 

Confidence bounds for a parameter 9 corresponding to a confidence level 1 — a 
were defined in Section 3.5, for the case that the distribution of the random 
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variable A' depends only on 9. When nuisance parameters d are present the 
defining condition for a lower confidence bound 6 becomes 

Po,&{9_{X) < 9} > 1 — a for all 9, d. (5.31) 

Similarly, confidence intervals for 9 at confidence level 1 — a are defined as a set 
of random intervals with end points 9(X), 9(X) such that 

Pe,i>{9{X) < 9 < 9{X)} > 1 - a for all 0,$. (5.32) 

The infimum over ( 9 , if) of the left-hand side of (5.31) and (5.32) is the confidence 
coefficient associated with these statements. 

As was already indicated in Chapter 3, confidence statements permit a dual 
interpretation. Directly, they provide bounds for the unknown parameter 9 and 
thereby a solution to the problem of estimating 9. The statement 9 < 9 < 9 is not 
as precise as a point estimate, but it has the advantage that the probability of it 
being correct can be guaranteed to be at least 1 — a. Similarly, a lower confidence 
bound can be thought of as an estimate 9 which overestimates the true parameter 
value with probability < a. In particular for a = \, if 9 satisfies 

PeA8<8} = PeAl>8} = 

the estimate is as likely to underestimate as to overestimate and is then said to 
be median unbiased. (See Problem 1.3, for the relation of this property to a more 
general concept of unbiasedness.) For an exponential family given by (4.10) there 
exists an estimator of 9 which among all median unbiased estimators uniformly 
minimizes the risk for any loss function L(9, d) that is monotone in the sense of 
the last paragraph of Section 3.5. A full treatment of this result including some 
probabilistic and measure-theoretic complications, is given by Pfanzagl (1979). 

Alternatively, as was shown in Chapter 3, confidence statements can be viewed 
as equivalent to a family of tests. The following is essentially a review of the dis¬ 
cussion of this relationship in Chapter 3, made slightly more specific by restricting 
attention to the two-sided case. For each 9q, let A(9o) denote the acceptance re¬ 
gion of a level-Q test (assumed for the moment to be nonrandomized) of the 
hypothesis H(9o) : 9 = do. If 

S{x) = {9 : x G A(9)} 

then 

9 G S(x) if and only if x G A(9), (5.33) 

and hence 

Pe,a{9 G 5(A)} > 1 - a for all 9, d. (5.34) 

Thus any family of level-a acceptance regions, through the correspondence (5.33), 
leads to a family of confidence sets at confidence level 1 — a. 

Conversely, given any class of confidence sets S(x) satisfying (5.34), let 

A{9) = {* : 9 G S(x)}. (5.35) 

Then the sets A(0o) are level-a acceptance regions for testing the hypotheses 
H(9o) : 9 = 9o, and the confidence sets S(x) show for each 9o whether for the 
particular x observed the hypothesis 9 = 9q is accepted or rejected at level a. 
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Exactly the same arguments apply if the sets A(9q) are acceptance regions 
for the hypotheses 9 < 9$. As will be seen below, one- and two-sided tests typi¬ 
cally, although not always, lead to one-sided confidence bounds and to confidence 
intervals respectively. 


Example 5.4.1 (Normal mean) Confidence intervals for the mean £ of a nor¬ 
mal distribution with unknown variance can be obtained from the acceptance 
regions A(£o) of the hypothesis H : £ = £o- These are given by 


l^(z-£o)| < r 

I)” ’ 

where C is determined from the t-distribution so that the probability of this 
inequality is 1 — a when £ = £o- [See (5.17) and (5.19) of Section 5.2.] The set 
S(x) is then the set of £’s satisfying this inequality with £ = £o, that is, the 
interval 



The class of these intervals therefore constitutes confidence intervals for £ with 
confidence coefficient 1 — a. 

The length of the intervals (5.36) is proportional to x/XX®* — *) 2 an d their 
expected length to a. For large a, the intervals will therefore provide little in¬ 
formation concerning the unknown £. This is a consequence of the fact, which 
led to similar difficulties for the corresponding testing problem, that two normal 
distributions N(£o,a 2 ) and 1 V(£i,< 7 2 ) with fixed difference of means become in¬ 
distinguishable as a tends to infinity. In order to obtain confidence intervals for 
£ whose length does not tend to infinity with a, it is necessary to determine the 
number of observations sequentially so that it can be adjusted to a. A sequential 
procedure leading to confidence intervals of prescribed length is given in Problems 
5.23 and 5.24. 

However, even such a sequential procedure does not really dispose of the dif¬ 
ficulty, but only shifts the lack of control from the length of the interval to the 
number of observations, As a —> oo, the number of observations required to ob¬ 
tain confidence intervals of bounded length also tends to infinity. Actually, in 
practice one will frequently have an idea of the order of magnitude of a. With 
a sample either of fixed size or obtained sequentially, it is then necessary to es¬ 
tablish a balance between the desired confidence 1 — a, the accuracy given by 
the length l of the interval, and the number of observations n one is willing to 
expend. In such an arrangement two of the three quantities 1 — a, l, and n will 
be fixed, while the third is a random variable whose distribution depends on a, 
so that it will be less well controlled than the others. If 1 — a is taken as fixed, 
the choice between a sequential scheme and one of fixed sample size thus depends 
essentially on whether it is more important to control l or n. 

To obtain lower confidence limits for £, consider the acceptance regions 


Vn(x - Co) ^ r 
sJY,(xi~x) 2 /{n-l) ~ ° 
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for testing £ < £o to against £ > £o. The sets S(x) arc then the one-sided intervals 

the left-hand sides of which therefore constitute the desired lower bounds £. If 
a = \, the constant Co is 0; the resulting confidence bound £ = A is a median 
unbiased estimate of £, and among all such estimates it uniformly maximizes 

P {—Ai < £ — £ < A 2 } for all Ai,A2>0. 

(For a proof see Section 3.5.) ■ 


5.5 Unbiased Confidence Sets 

Confidence sets can be viewed as a family of tests of the hypotheses 0 £ H(9') 
against alternatives 9 £ K(9') for varying 9'. A confidence level of 1 — a then 
simply expresses the fact that all the tests are to be at level a, and the condition 
therefore becomes 

Pe,-t>{9' £5(A)} > 1-a for all 9 £ H{9') and all 9. (5.37) 

I 11 the case that H(9') is the hypothesis 9 = 9' and 5(A) is the interval 
[0(A), 0(A')], this agrees with (5.32). I 11 the one-sided case in which H(9') is 
the hypothesis 9 < 9' and 5(A) = {9 : 0(A) < 0}, the condition reduces to 
Po,fi{9{ A) < 9'} > 1 — a for all 0' > 0, and this is seen to be equivalent to (5.31). 
With this interpretation of confidence sets, the probabilities 

PeA9' € 5(A)}, 0 £ K(9'), (5.38) 

are the probabilities of false acceptance of H(9‘) (error of the second kind). The 
smaller these probabilities are, the more desirable are the tests. 

From the point of view of estimation, on the other hand, (5.38) is the prob¬ 
ability of covering the wrong value 0'. With a controlled probability of covering 
the true value, the confidence sets will be more informative the less likely they 
are to cover false values of the parameter. In this sense the probabilities (5.38) 
provide a measure of the accuracy of the confidence sets. A justification of (5.38) 
in terms of loss functions was given for the one-sided case in Section 3.5. 

In the presence of nuisance parameters, UMP tests usually do not exist, and 
this implies the nonexistence of confidence sets that are uniformly most accurate 
in the sense of minimizing (5.38) for all 9' such that 0 £ K(9') and for all 9. 
This suggests restricting attention to confidence sets which in a suitable sense 
are unbiased. In analogy with the corresponding definition for tests, a family of 
confidence sets at confidence level 1 — a is said to be unbiased if 

PeA9' £ 5(A)} < 1-a (5.39) 

for all 0' such that 0 £ K(9') and for all 1 } and 0, 

so that the probability of covering these false values does not exceed the 
confidence level. 
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In the two- and one-sided cases mentioned above, the condition (5.39) reduces 
to 


Pg t ${6 < 6' < 9} < 1 — a for all 8' 8 and all 8 

and 


Pe,&{9_ < 9'} < 1 — a for all 9' < 9 and all 8. 

With this definition of unbiasedness, unbiased families of tests lead to unbiased 
confidence sets and conversely. A family of confidence sets is uniformly most 
accurate unbiased at confidence level 1 — a if it minimizes the probabilities 

Pg t a{9' € S'(X)} for all 8' such that 8 £ K(8') and for all 8 and 9, 

subject to (5.37) and (5.39). The confidence sets obtained on the basis of the 
UMP unbiased tests of the present and preceding chapter are therefore uniformly 
most accurate unbiased. This applies in particular to the confidence intervals 
obtained in the preceding sections. Some further examples are the following. 


Example 5.5.1 (Normal variance) If AT,... ,X n is a sample from N(£,a 2 ), 
the UMP unbiased test of the hypothesis cr = ao is given by the acceptance region 
(5.13) 


C[ < 




< CT 


where C[ and C '2 are determined by (5.14). The most accurate unbiased 
confidence intervals for a 2 are therefore 


cI^>- : 


<°- 2 < cfYs( Xi -- 


[Tables of C[ and C '2 are provided by Tate and Klett (1959).] Similarly, from 
(5.9) and (5.10) the most accurate unbiased upper confidence limits for a 2 are 


- Co ^ {Xi : 


where 



= 1 — a. 


The corresponding lower confidence limits are uniformly most accurate (without 
the restriction of unbiasedness) by Section 3.9. ■ 


Example 5.5.2 (Difference of means) Confidence intervals for the difference 
A = 77 — £ of the means of two normal distributions with common variance are 
obtained from tests of the hypothesis £ = Ao- If X \,..., X m and Yi ,..., Y n are 
distributed as N(£, a 2 ) and N(ri, a 2 ) respectively, and if Yj = Yj—A 0 , r/ = 77 —Ao, 
the hypothesis can be expressed in terms of the variables X, and Yj as 77 ' — £ = 0. 
From (5.28) and (5.30) the UMP unbiased acceptance region is then seen to be 


\{y — x — A 0 )| j + l 


J\ EO; - x) 2 + E(2T - y) 2 ] j (m + n- 2) 


<C, 
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where C is determined by the equation following (5.30). The most accurate 
unbiased confidence intervals for r/ — £ are therefore 

(y - x) - CS < r) - £ < (y - x) + CS (5.40) 

where 

S 2 = (}_ + i \ T,{xi-x) 2 + Efe -y) 2 
\m n J m + n — 2 

The one-sided intervals are obtained analogously. ■ 


Example 5.5.3 (Ratio of variances) If Xi ,..., X m and Yy ,..., Y n are sam¬ 
ples from 7V(£, or 2 ) and N(ri,T 2 ), most accurate unbiased confidence intervals for 
A = t 2 /<t 2 are derived from the acceptance region (5.23) as 


i - c 2 E(» - v) 2 


_< < i - Ci Efo - v ) 2 

C 2 J2( X i - *) 2 “ V 2 - Cl J2( X i~ X ) 2 ’ 


(5.41) 


where Ci and C 2 are determined from (5.25). 3 In the particular case that m = n, 
the intervals take on the simpler form 


.-t\2 


1 Efei - v) 


.-.\2 


_< L. < k T,(yj-y) 

kJ2( Xi -x)2 ~ a2 - E(®i-*) 2 ’ 


(5.42) 


where k is determined from the E-distribution. Most accurate unbiased lower 
confidence limits for the variance ratio are 


1 Efai ~ y) 2 /(n~ 1) <• r_ 

Co E(*t - x) 2 /(m - 1) “ a 2 


with Co given by (5.22). If in (5.22) a is taken to be this lower confidence 
limit A becomes a median unbiased estimate of t 2 /ct 2 . Among all such estimates 
it uniformly minimizes 


P 


—Ai < — — A < A 2 

a z 


for all Ai, A 2 > 0. 


(For a proof see Section 3.5). ■ 


So far it has been assumed that the tests from which the confidence sets are 
obtained are nonrandomized. The modifications that are necessary when this 
assumption is not satisfied were discussed in Chapter 3. The randomized tests can 
then be interpreted as being nonrandomized in the space of A' and an auxiliary 
variable V which is uniformly distributed on the unit interval. If in particular A' 
is integer-valued as in the binomial or Poisson case, the tests can be represented 
in terms of the continuous variable X + V. In this way, most accurate unbiased 
confidence intervals can be obtained, for example, for a binomial probability p 
from the UMP unbiased tests of H : p = po (Example 4.2.1). It is not clear a 
priori that the resulting confidence sets for p will necessarily by intervals. This 
is, however, a consequence of the following Lemma. 


3 A comparison of these limits with those obtained from the equal-tails test is given 
by Scheffe (1942); some values of C\ and C 2 are provided by Ramachandran (1958). 
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Lemma 5.5.1 Let X be a real-valued, random variable with probability density 
pe{x) which has monotone likelihood ratio in x. Suppose that UMP unbiased tests 
of the hypotheses H(0o) : 9 = do exist and are given by the acceptance regions 

Ci{0 o ) <x< C 2 {9 0 ) 

and that they are strictly unbiased. Then the functions Ci(9) are strictly increasing 
in 9, and the most accurate unbiased confidence intervals for 9 are 

C 2 _1 ( x) <9< Cf\x). 

Proof. Let 9o < 9\, and let Po(9) and (3i(9) denote the power functions of the 
above tests <j >o and <j>i, for testing 9 = 9q and 9 = 9 1 . It follows from the strict 
unbiasedness of the tests that 

Ee 0 [MX)~MX)\ = fc.(0o) - a > 0 > a - 0o(Oi) 

= Eg 1 [<j>i{X) — 4>o(X)]. 

Thus neither of the two intervals [Ci(9i), C 2 (9i)] (i = 0,1) contains the other, and 
it is seen from Lemma 3.4.2(iii) that Ci(9o) < Ci(9 1 ) for i = 1,2. The functions 
Ci therefore have inverses, and the inequalities defining the acceptance region for 
H(6) are equivalent to Cf 1 (x) < 9 < Cf 1 (x), as was to be proved. ■ 

The situation is indicated in Figure 5.1. From the boundaries x = Ci(9) and 
x = C 2 (9) of the acceptance regions A(9) one obtains for each fixed value of x 
the confidence set S(x) as the interval of 9' s for which Ci(9) < x < C 2 {9). 



Figure 5.1. 

By Section 4.2, the conditions of the lemma are satisfied in particular for a 
one-parameter exponential family, provided the tests are nonrandomized. In cases 
such as that of binomial or Poisson distributions, where the family is exponential 
but X is integer-valued so that randomization is required, the intervals can be 
obtained by applying the lemma to the variable X + V instead of X, where V is 
independent of X and uniformly distributed over (0,1). 

Example 5.5.4 In the binomial case, a table of the (randomized) uniformly 
most accurate unbiased confidence intervals is given by Blyth and Hutchinson 
(1960). The best choice of nonrandomized intervals and some approximations 
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are discussed (and tables provided) by Blyth and Still (1983) and Blyth (1984). 
Recent approximations and comparisons are provided by Agresti and Coull (1998) 
and Brown, Cai and DasGupta (2001, 2002). A large sample approach will be 
considered in Example 11.2.7. ■ 

In Lemma 5.5.1, the distribution of A' was assumed to depend only on 9. 
Consider now the exponential family (5.1) in which nuisance parameters are 
present in addition to 9. The UMP unbiased tests of 9 = 9q, are then performed 
as conditional tests given T = t, and the confidence intervals for 9 will as a 
consequence also be obtained conditionally. If the conditional distributions are 
continuous, the acceptance regions will be of the form 

Ci(9;t) <u< C 2 (#;t), 

where for each t the functions Ci are increasing by Lemma 5.5.1. The confidence 
intervals are then 

C^CM) < 9 < CT 1 (u; 1). 

If the conditional distributions are discrete, continuity can be obtained as before 
through addition of a uniform variable. 

Example 5.5.5 (Poisson ratio) Let X and Y be independent Poisson vari¬ 
ables with means A and p, and let p = p/A. The conditional distribution of Y 
given X + Y = t is the binomial distribution b(p, t) with 


The UMP unbiased test (p{y , t) of the hypothesis p = po is defined for each t as 
the UMP unbiased conditional test of the hypothesis p = po/(l + Po)- If 

Pit) <P< Pit) 

are the associated most accurate unbiased confidence intervals for p given t, it 
follows that the most accurate unbiased confidence intervals for p/A are 

< M < Pit) 

1 - Pit) ~ A “ 1 - p{t)' 

The binomial tests which determine the functions p(t) and p(t) are discussed in 
Example 4.2.1. ■ 


5.6 Regression 

The relation between two variables X and Y can be studied by drawing an 
unrestricted sample and observing the two variables for each subject, obtaining 
n pairs of measurements (Xi, Yi),..., (X n , Y n ) (see Section 5.13 and Problem 
5.13). Alternatively, it is frequently possible to control one of the variables such 
as the age of a subject, the temperature at which an experiment is performed, 
or the strength of the treatment that is being applied. Observations Y \,..., Y n 
of Y r can then be obtained at a number of predetermined levels xi ,..., x„ of *. 
Suppose that for fixed x the distribution of Y is normal with constant variance 
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a 2 and a mean which is a function of x, the regression of Y on x, and which is 
assumed to be linear , 4 


.E[Y|a?] = a + fix. 

If we put Vi = (Xi — x)/y/J2( x i ~ x ) 2 an< A 7 + <5*7 = « + P x i, so that v * = 0, 

J2 v i = l, and 

a = 'y-5 _-_ Q= _ S _ 

x/Efe-*) 2 ’ VEfe-*) 2 ’ 

the joint density of Yi,..,, Y n is 


(y/2noy 


■ exp 




These densities constitute an exponential family (5.1) with 


U = J2viYi, Ti Y 2 , T 2 = £Y 


e = 


6 


di = 


l 



d 2 


7 

7^' 


This representation implies the existence of UMP unbiased tests of the hypotheses 
<27 + b5 = c where a, b, and c are given constants, and therefore of most accurate 
unbiased confidence intervals for the parameter 


p = a 7 + bS. 


To obtain these confidence intervals explicitly, one requires the UMP unbiased 
test of H : p = po, which is given by the acceptance region 


\bJ2 ViYi + aY - p 0 | jy/(a 2 /n) + b 2 

[E(Y-F) 2 -(E^) 2 ] /(n- 2) 


(5.44) 


where 


[ t„- 2 {y)dy == 1 - a ; 

J-c 


see Problem 5.33. The resulting confidence intervals for p are centered at 
bJ2 I’iY + aY, and their length is 


L = 



E(^-E 2 -(E ^) 2 

n — 2 


It follows from the transformations given in Problem 5.33 that |^(Y — Y ) 2 — 
/a 2 has a x 2 -distribution with n — 2 degrees of freedom and hence that 


4 The literature on regression is enormous and we treat the simplest model. Some texts 
on the subject include Weisberg (1985), Atkinson and Riani (2000) and Chatterjee, Hadi 
and Price (2000). 
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the expected length of the intervals is 

E{L) = 2 C n aJ—+b 2 . 

V n 

In particular applications, a and b typically are functions of the x’s. If these are 
at the disposal of the experimenter and there is therefore some choice with respect 
to a and 6, the expected length of L is minimized by minimizing (a 2 /n) + b 2 . 
Actually, it is not clear that the expected length is a good criterion for the 
accuracy of confidence intervals, since short intervals are desirable when they 
cover the true parameter value but not necessarily otherwise. However, the same 
result holds for other criteria such as the expected value of ( p — p) 2 + (p — p ) 2 or 
more generally of /i(|p — p|) + / 2 (|p — p|), where /i and are increasing functions 
of their arguments. (See Problem 5.33.) Furthermore, the same choice of a and 
b also minimizes the probability of the intervals covering any false value of the 
parameter. We shall therefore consider ( a 2 /n ) + b 2 as an inverse measure of the 
accuracy of the intervals. 


Example 5.6.1 (Slope of regression line) Confidence levels for the slope 
/3 = 8/y/^2,{xj — x) 2 are obtained from the above intervals by letting a = 0 
and b = l/y/'^2( x j ~ *) 2 • Here the accuracy increases with YL( x j — a:) 2 , and if 
the Xj must be chosen from an interval [Co, Ci], it is maximized by putting half 
of the values at each end point. However, from a practical point of view, this 
is frequently not a good design, since it permits no check of the linearity of the 
regression. ■ 


Example 5.6.2 (Ordinate of regression line) Another parameter of inter¬ 
est is the value a + (3x o to be expected from an observation Y at x = xo- 
Since 


a + fix o = 7 + 


8(xp - x) 


the constants a and b are a = 1, b = (xo — x) / \/^2(xj — x) 2 . The maximum 
accuracy is obtained by minimizing \x — xq\ and, if x = xo cannot be achieved 
exactly, also maximizing ^2{xj — x) 2 . ■ 


Example 5.6.3 (Intercept of regression line) Frequently it is of interest to 
estimate the point x at which a+f3x has a preassigned value. One may for example 
wish to find the dosage x = —a//3 at which E(Y \ x) = 0, or equivalently the 
value v = (x — x)/yf^2,(xj — x) 2 at which 7 + 5v = 0. Most accurate unbiased 
confidence sets for the solution — 7 /8 of this equation can be obtained from the 
UMP unbiased tests of the hypotheses —7 /5 = Vo- The acceptance regions of 
these tests are given by (5.44) with a = 1, b = vo, and po = 0, and the resulting 
confidence sets for v are the sets of values v satisfying 


C 2 S 2 - 


(J2 ViYi) 2 - 2 vY ViYi) + ~ ( C 2 S 2 - nY 2 ) > 0 . 
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where S 2 = [^(Yj — Y) 2 (J] ViY) 2 ]/(n — 2). If the associated quadratic equation 
in v has roots v, v, the confidence statement becomes 

v < v < v when -——- > C 

O 

and 

v < v or v > v when -—— - < C. 

The somewhat surprising possibility that the confidence sets may be the outside 
of an interval actually is quite appropriate here. When the line y = 7+<5v is nearly 
parallel to the v-axis, the intercept with the v-axis will be large in absolute value, 
but its sign can be changed by a very small change in angle. There is the further 
possibility that the discriminant of the quadratic polynomial is negative, 

nY 2 + (jT^YiY < C 2 S 2 , 

in which case the associated quadratic equation has no solutions. This condition 
implies that the leading coefficient of the quadratic polynomial is positive, so 
that the confidence set in this case becomes the whole real axis. The fact that 
the confidence sets are not necessarily finite intervals has led to the suggestion 
that their use be restricted to the cases in which they do have this form. Such 
usage will however affect the probability with which the sets cover the true value 
and hence the validity of the reported confidence coefficient . 5 ■ 


5.7 Bayesian Confidence Sets 

The left side of the confidence statement (5.34) denotes the probability that 
the random set S(X) will contain the constant point 9. The interpretation of 
this probability statement, before X is observed, is clear: it refers to the fre¬ 
quency with which this random event will occur. Suppose for example that X is 
distributed as N(8, 1), and consider the confidence interval 

X - 1.96 < 9 < X + 1.96 

corresponding to confidence coefficient 7 = .95. Then the random interval (X — 
1.96, A' + 1.96) will contain 9 with probability .95. Suppose now that X is observed 
to be 2.14. At this point, the earlier statement reduces to the inequality 0.18 < 
9 < 4.10, which no longer involves any random element. Since the only unknown 
quantity is 9 , it is tempting (but not justified) to say that 8 lies between 0.18 
and 4.10 with probability .95. 

To attach a meaningful probability to the event 8 £ S(x) when x is fixed 
requires that 9 be random. Inferences made under the assumption that the 
parameter 8 is itself a random (though unobservable) quantity with a known 


5 A method for obtaining the size of this effect was developed by Neyman, and tables 
have been computed on its basis by Fix. This work is reported by Bennett (1957). 
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distribution are called Bayesian, and the distribution A of 8 before any observa¬ 
tions are taken its prior distribution. After X = x has been observed, inferences 
concerning 8 can be based on its conditional distribution given x, the posterior 
distribution. In particular, any set S(x) with the property 

P [9 € S(x) | X = x] > 7 for all x 

is a 1007 % Bayesian confidence set or credible region for 8. In the rest of this 
section, the random variable with prior distribution A will be denoted by 0, with 
8 being the value taken on by 0 in the experiment at hand. 


Example 5.7.1 (Normal mean) Suppose that 0 has a normal prior distribu¬ 
tion N(p, b 2 ) and that given <3 = 6, the variables X\,..., X n . are independent 
N(6, a 2 ), a known. Then the posterior distribution of 0 given x\,..., x n is normal 
with mean (Problem 5.34) 


Vx = E[<3 | x\ 


nx/a 2 + p/b 2 
n/a 2 + 1/b 2 


and variance 


t 2 = Var[Q | x] 


1 

n/a 2 + 1 / 6 2 


Since [0 — t] x ]/t x then has a standard normal distribution, the interval I(x) with 
endpoints 

nx/a 2 + p/b 2 + 1.96 

n/a 2 + 1/b 2 sjn/a 2 + 1/b 2 


satisfies P[0 G I(x) \ X = x\ = .95 and is thus a 95% credible region. 
For n = 1, p = 0, a = 1, the interval reduces to 


x + 1.96 

1 + ^ 


which for large b is very close to the confidence interval for 8 stated at the 
beginning of the section. But now the statement that 8 lies between these limits 
with probability .95 is justified, since it is a probability statement concerning the 
random variable 0 . 

The distribution N(p,b 2 ) assigns higher probability to 0-values near p than 
to those further away. Suppose instead that no information whatever about 8 is 
available, so that one wishes to model a state of complete ignorance. This could 
be done by assigning a constant density to all values of 8, that is, by assigning 
to 0 the density n(6) = c, —00 < 8 < 00 . Unfortunately, the resulting ir is not a 
probability density, since f/° n(8) dd = 00 . However, if this fact is ignored and 
the posterior distribution of 0 given x is calculated in the usual way, it turns out 
(Problem 5.35) that 7r(0 | x) is the density of a genuine probability distribution, 
namely N(p, a 2 /n), the limit of the earlier posterior distribution as b —> 00 . The 
improper (since it integrates to infinity), noninformative prior density 7 r(0) = c 
thus leads approximately to the same results as the normal prior N(p,b 2 ) for 
large b, and can be viewed as an approximation to the latter. ■ 



5.7. Bayesian Confidence Sets 173 


Unlike confidence sets, Bayesian credible regions provide exactly the desired 
kind of probability statement even after the observations are known. They do 
so, however, at the cost of an additional assumption: that 9 is random and 
has a known prior distribution. Detailed accounts of the Bayesian approach, its 
application to credible regions, and comparison of the two approaches can be 
found in Berger (1985a) and Robert (1994). The following examples provide a 
few illustrations and additional comments. 


Example 5.7.2 Let X be binomial b(p,n), and suppose that the prior distribu¬ 
tion forp is the beta distribution 6 B(a , b) with density C'p a_ 1 (l—p) 6_1 , 0 < p < 1, 
0 < a, b. Then the posterior distribution of p given X = x is the beta distribution 
B(a+x , b+n—x) (Problem 5.36). There are of course many sets S(x) whose prob¬ 
ability under this distribution is equal to the prescribed coefficient 7 . A choice 
that is frequently recommended is the HPD (highest probability density) region, 
defined by the requirement that the posterior density of p given x be > k. 

With a beta prior, only the following possibilities can occur: for fixed x, 

(a) 7 r(p I x) is decreasing, 

(b) 7r(p I a:) is increasing, 

(c) 7 r(p | x) is increasing in ( 0 ,po) and decreasing in ( po , 1 ) for some po, 

(d) 7 r(p | x) is U-shaped, i.e. decreasing in (0,po) and increasing in (po,l) for 
some po ■ 

The HPD region then is of the form 

(a) p < K(—x), 

(b) p > K(x), 

(c) K 1 (x) < p < K 2 (x), 

(d) p < A'i(*) or p > K 2 (x), 

where the A’s are determined by the requirement that the posterior probability of 
the region, given x, be 7 ; in cases (c) and (d) this condition must be supplemented 
by 

7 r[AT(a;) | x] = tt[K 2 (x) \ x\. 

I 11 general, if 7 t( 8 \ x) denotes the posterior density of 9, the HPD region is defined 
by 

7r (9 | x) > k 

with C determined by the size condition 

P[n(9) | x) > k] =* 7 . ■ 


f ’Tliis is the so-called conjugate of the binomial distribution; for a more general dis¬ 
cussion of conjugate distributions, see Chapter 4 of TPEZ and Robert (1994), Section 
3.2. 
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Example 5.7.3 (Two-parameter normal mean) Let X \,..., X n be inde¬ 
pendent JV(£, cr 2 ), and for the sake of simplicity suppose that (£, a) has the joint 
improper prior density given by 

7t(£, cr) d £ da = d£— da for all — oo < £ < oo, 0 < a, 
a 

which is frequently used to model absence of information concerning the param¬ 
eters. Then the joint posterior density of (£, a) given x = (xi ,..., x n ) is of the 
form 


7t(£, a \ x) d^da = C(x) - 


-n+1 


exp 





da. 


Determination of a credible region for £ requires the marginal posterior density 
of given x, which is obtained by integrating the joint posterior density with 
respect to a. These densities depend only on the sufficient statistics x and S 2 = 
5 ~2(xi — x ) 2 , and the posterior density of £ is of the form (Problem 5.37) 


A(x) 


-| n/2 


1 + 


n(£ — x) 2 

s* 


Here x and S enter only as location and scale parameters, and the linear function 


t = a /»(£ ~ x) 

S/Vn ^T 

of £ has the t-distribution with n — 1 degrees of freedom. Since this agrees with the 
distribution of t for fixed £ and a given in Section 5.2, the credible 100(1 — a)% 
region 

- x) < 

S/Vr^l ~ 

is formally identical with the confidence intervals (5.36). However, they are de¬ 
rived under different assumptions, and their interpretation differs accordingly. 
The relationship between Bayesian intervals and classical intervals is further 
explored in Nicolaou (1993) and Severini (1993). ■ 


Example 5.7.4 (Two-parameter normal: estimating cr) Under the assump¬ 
tions of the preceding example, credible regions for cr are based on the posterior 
distribution of a given x, obtained by integrating the joint posterior density of 
(£, a) with respect to £. Using the fact that ^2(^ — Xi) 2 = n(£ — x) 2 + ~ *) 2 > 

it is seen (Problem 5.38) that given x, the conditional (posterior) distribution 
of J2(xi ~ x) 2 /a 2 is X' 2 with n — 1 degrees of freedom. As in the case of the 
mean, this agrees with the sampling distribution of the same quantity when a 
is a (constant) parameter, given in Section 5.2. (The agreement in both cases of 
two distributions derived under such different assumptions is a consequence of 
the particular choice of the prior distribution and the fact that it is invariant in 
the sense of TPE2, Section 4.4.) A change of variables now gives the posterior 
density of a and shows that 7r(<7 | x) is of the form (c) of Example 5.7.2, so that 
the HPD region is of the form A'i (*) < a < K2 (x) with 0 < K\ (x) < K2 (x) < 00. 

Suppose that a credible region is required, not for a, but for a r for some r > 0. 
For consistency, this should then be given by [Ki(x)] r < a r < [J\' 2 (a:)] r , but this 
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is not the case, since the relative height of the density of a random variable at two 
points is not invariant under monotone transformations of the variable. In fact, 
in the present case, the HPD region for a r will become one-sided for sufficiently 
large r although it is two-sided for r = 1 (Problem 5.38). ■ 


Such inconsistencies do not occur if the HPD region is replaced by the equal- 
tails interval C 2 (x)) for which P[0 < Ci(a;) \ X = x] = P[0 > C 2 (x) \ 

X = x] = (1 — 7)/2. 7 More generally inconsistencies under transformations of 
0 are avoided when the posterior distribution of 0 is summarized by a number 
of its percentiles corresponding to the standard confidence points mentioned in 
Section 3.5. Such a set is a compromise between providing the complete posterior 
distribution and providing a single interval corresponding to only two percentiles. 

Both the confidence and the Bayes approach present difficulties: the first, the 
problem of postdata interpretation; the second, the choice of a prior distribution 
and the interpretation of the posterior coverage probabilities if there is no clear 
basis for this choice. It is therefore not surprising that efforts have been made to 
find an approach without these drawbacks. The first such attempt, from which 
most later ones derive, is due to Fisher [1930; for his final account see Fisher 
(1973)]. 

To discuss Fisher’s concept of fiducial probability, consider once more the ex¬ 
ample at the beginning of the section, in which X is distributed as N(6, 1). Since 
then X — 9 is distributed as IV(0,1), so is 9 — X , and hence 

P(9 - X < y) = $(y) for all y. 

For fixed X = x, this is the formal statement that a random variable 9 has dis¬ 
tribution N(x,l). Without assuming 9 to be random, Fisher calls N(x, 1) the 
fiducial distribution of 9. Since this distribution is to embody the information 
about 9 provided by the data, it should be unique, and Fisher imposes conditions 
which he hopes will ensure uniqueness. This leads to some technical difficulties, 
but more basic is the question of how to interpret fiducial probability. In a series 
of independent repetitions of the experiment with arbitrarily varying 9i, the quan¬ 
tities 9\ — Xi, 02 — X 2 , ... will constitute a sequence of independent standard 
normal variables. From this fact, Fisher attempts to derive the fiducial distri¬ 
bution N(x, 1) of 9 as a frequency distribution with respect to an appropriate 
reference set. However, this argument is difficult to follow and unconvincing. For 
summaries of the fiducial literature and of later related developments by Demp¬ 
ster, Fraser, and others, see Buehler (1983), Edwards (1983), Seidenfeld (1992), 
Zabell (1992), Barnard (1995, 1996) and Fraser (1996). 

Fisher’s effort to define a suitable frame of reference led him to the important 
concept of relevant subsets, which will be discussed in Chapter 10. 

To appreciate the differences between the frequentist, Bayesian and Fisherian 
points of view, see Lehmann (1993), Robert (1994), Berger, Boukai and Wang 
(1997), Berger (2003) and Bayarri and Berger (2004). 


7 They also do not occur when the posterior distribution of 0 is discrete. 
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5.8 Permutation Tests 


For the comparison of a treatment with a control situation in which no treatment 
is given, it was shown in Section 5.3 that the one-sided f-test is UMP unbiased 
for testing H : r/ = £ against y — f = A > 0 when the measurements X\ ,..., X m 
and Yi,...,Y n are samples from normal populations A(£, cr 2 ) and N(ri,a 2 ). It 
will be shown in Section 11.3 that the level of this test is (asymptotically) robust 
against nonnormality - that is, that except for small m or n the level of the test 
is approximately equal to the nominal level a when the X’s and Y’s are samples 
from any distributions with densities f(x) and f(y — A) with finite variance. If 
such an approximate level is not satisfactory, one may prefer to try to obtain 
an exact level-a unbiased test (valid for all /) by replacing the original normal 
model with the nonparametric model for which the joint density of the variables 
is 

f(x 1 )...f(x m )f(y 1 -A)...f(y n -A), f&X, (5.45) 

where we shall take T to be the family of all probability densities that are 
continuous a.e. 

If there is much variation in the population being sampled, the sensitivity of 
the experiment can frequently be increased by dividing the population into more 
homogeneous subgroups, defined for example by some characteristic such as age 
or sex. A sample of size Aj(i = 1,..., c) is then taken from the ith subpopulation: 
rrii to serve as controls, and the other n» = Ni — mj, to receive the treatment. If 
the observations in the ith subgroup of such a stratified sample are denoted by 

(X'l 1, . . . , X imi , 4 jl, • • • , ) — ( Zi\ , • • • , ) j 

the density of Z — (Zn,..., Z c n c ) is 

C 

Pa(z) = ... fi{x imi )fi(y n -A)... - A)]. (5.46) 

i= 1 

Unbiasedness of a test 0 for testing A = 0 against A > 0 implies that for all 

/!,•••,/c, 

f 4>(z)po(z) dz = a (dz = dzu ... dZcN c )■ (5A7) 


Theorem 5.8.1 If T is the family of all probability densities f that are 
continuous a.e., then (5.47) holds for all /i,..., f c £ T if and only if 


1 

Nil... Ac! 


2 , €5(z) 


= a 


(5.48) 


where S(z) is the set of points obtained from z by permuting for each i = 1 ,... ,c 
the coordinates Zij(j = 1,..., Ni) within the ith subgroup in all Nfi.... N c \ possible 
ways. 


Proof. To prove the result for the case c = 1, note that the set of order statistics 
T(Z) = (Z( i), ..., Z( N fi) is a complete sufficient statistic for T (Example 4.3.4). 
A necessary and sufficient condition for (5.47) is therefore 

E[<t>{Z) | T(z)\ = a a.e. 


(5.49) 
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The set S(z) in the present case (c = 1) consists of the N points obtained from 
z through permutation of coordinates, so that S(z) = {z' : T(z') = T(z)}. It 
follows from Section 2.4 that the conditional distribution of Z given T(z) assigns 
probability 1/N\ to each of the N\ points of S(z). Thus (5.49) is equivalent to 

E ^PO = a a - e -’ ( 5 - 5 °) 

z'GS(z) 

as was to be proved. The proof for general c is completely analogous and is left 
as an exercise (Problem 5.44.) ■ 

The tests satisfying (5.48) are called permutation tests. An extension of this 
definition is given in Problem 5.54. 


5.9 Most Powerful Permutation Tests 


For the problem of testing the hypothesis H : A = 0 of no treatment effect on 
the basis of a stratified sample with density (5.46) it was shown in the preceding 
section that unbiasedness implies (5.48). We shall now determine the test which, 
subject to (5.48), maximizes the power against a fixed alternative (5.46) or more 
generally against an alternative with arbitrary fixed density h(z). 

The power of a test <j> against an alternative h is 

/ <t>(z)h(z)dz = 1 E[4>(Z) | t] dp T (t). 

Let t = T(z) = ( 2 (i),..., 2(iv)), so that S(z) = S(t). As was seen in Example 
2.4.1 and Problem 2.6, the conditional expectation of <j>{Z) given T(Z) = t is 


ip(t) 


E <t>(z)h{z) 
zes(t) 

E h(z) 

zes(t) 


To maximize the power of <j> subject to (5.48) it is therefore necessary to maxi¬ 
mize ip(t) for each t subject to this condition. The problem thus reduces to the 
determination of a function <p which subject to 


zes(t) 


1 

Ah!... W! 


= «, 


maximizes 


E 

z€S(t) 


4>{z) 


E ftp')' 

z'ex(t) 


By the Neyman-Pearson fundamental lemma, this is achieved by rejecting H for 
those points 2 of S(t) for which the ratio 

/ip) Ah!... AT C ! 

E Hz’) 

z'es(t) 
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is too large. Thus the most powerful test is given by the critical function 

1 when h{z) > C[T(z)\, 

7 when h(z ) = C[T(z)j, (5.51) 

0 when h{z) < C[T(z)j. 

To carry out the test, the 2Vi!... N c \ points of each set S(z) are ordered according 
to the values of the density h. The hypothesis is rejected for the k largest values 
and with probability 7 for the ( k + l)st value, where k and 7 are defined by 

k + 7 = aNi\... N c \. 

Consider now in particular the alternatives (5.46). The most powerful per¬ 
mutation test is seen to depend on A and the fi, and is therefore not 
UMP. 

Of special interest is the class of normal alternatives with common variance: 

fi = N(£i, a 2 ). 

The most powerful test against these alternatives, which turns out to be indepen¬ 
dent of the £j, a 2 , and A, is appropriate when approximate normality is suspected 
but the assumption is not felt to be reliable. It may then be desirable to control 
the size of the test at level a regardless of the form of the densities fi and to 
have the test unbiased against all alternatives (5.46). However, among the class 
of tests satisfying these broad restrictions it is natural to make the selection so as 
to maximize the power against the type of alternative one expects to encounter, 
that is, against the normal alternatives. 

With the above choice of /;, (5.46) becomes 

h{z) = x 



exp 


2a 2 


E E (*« - &) 2 + E (*« - 6 - A ) 2 


*=1 U =1 


J = 771j +1 
\ 2 


(5.52) 


Since the factor exp [—J2iJ2jh( z ij — &) /2cr^] is constant over S(t), the test 
(5.51) therefore rejects H when exp(A JT -t-i z v) > C[T(z)] and hence 

when 


EE 2 Hi — EE Zij > C[T(z)}. (5.53) 

i= 1 j =1 i= 1 j=m^-\-1 

Of the Ni \... N c \ values that the test statistic takes on over S(t), only 


are distinct, since the value of the statistic is the same for any two points z' and 
z!' for which ( za,. ■ ■, z ' irni ) and {z[ 1 ,..., z " m .) are permutations of each other for 
each i. It is therefore enough to compare these distinct values, and to reject H 
for the k! largest ones and with probability 7 ' for the ( k' + l)st, where 


k' + 7 ' 
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The test (5.53) is most powerful against the normal alternatives under consid¬ 
eration among all tests which are unbiased and of level a for testing H : A = 0 
in the original family (5.46) with /i,..., f c £ J-. s To complete the proof of this 
statement it is still necessary to prove the test unbiased against the alternatives 
(5.46). We shall show more generally that it is unbiased against all alternatives for 
which Xij(j = 1,..., rrii), Yik(k = 1,..., m) are independently distributed with 
cumulative distribution functions Fi, Gi respectively such that Yfc is stochas¬ 
tically larger than Xfj, that is, such that Gi(z) < Fi(z) for all 2 . This is a 
consequence of the following lemma. 

Lemma 5.9.1 Xi,..., X m , Yi,..., Y n be samples from continuous distributions 
F, G, and let 4>(xi,..., x m ; yi ,..., y n ) be a critical function such that (a) its 
expectation is a whenever G — F , and (b) yi < y[ for i = 1,..., n implies 

</>(*!,... ,* m ; 2/i,..., y n ) < ... ,Xm\y'i, ■ ■ ■ ,y' n )- 

Then the expectation (3 = f3(F,G) of <j> is > a for all pairs of distributions for 
which Y is stochastically larger than X; it is < a if X is stochastically larger 
than Y. 

Proof. By Lemma 3.4.1, there exist functions /, g and independent random 
variables Vi,..., V m +n such that the distributions of /(Vi) and g(Vi ) are F and 
G respectively and that f(z) < g(z) for all 2 . Then 

mf(v i),..., f{v m y, f(v m + 1),.... f(Vm+n)} = a 

and 


mf(V i),..., f(V m y,g(V m+ 1 ),..., g(Vm+n)\ = 0. 

Since for all (vi ,... , Vm+n), 

<P[f(v l), . . . , f{v m ); f(v m + 1 ), • • • , f{vm+n)] 

< 0[/(vl), ■ ■ • , f(Vm); g(Vm+ 1 ), . . •, g{v rrl+n )], 

the same inequality holds for the expectations of both sides, and hence a < 0. 

The proof for the case that X is stochastically larger than Y is completely 
analogous. 

The lemma also generalizes to the case of c vectors (Xu ,..., Xi mi ; Yi,..., Yi ni ) 
with distributions (Fi, Gi). If the expectation of a function <j> is a when Ft = Gi 
and <f> is nondecreasing in each ytj when all other variables are held fixed, then 
it follows as before that the expectation of is > a when the random variables 
with distribution Gi are stochastically larger than those with distribution F % . 

In applying the lemma to the permutation test (5.53) it is enough to consider 
the case c = 1, the argument in the more general case being completely analogous. 
Since the rejection probability of the test (5.53) is a whenever F = G, it is only 
necessary to show that the critical function (j> of the test satisfies (b). Now <j> = 1 
if 'Y^T=rn+ 1 Zi excee ds sufficiently many of the sums X/Iim+i z U-< all d hence if 


For a closely related result, see Oden and Wedel (1975). 
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sufficiently many of the differences 


m+n 


m-\-n 


E *- E 


i=m -\-1 i=m -\-1 

are positive. For a particular permutation (ji,... ,j m +n) 


E * E Zj i - E Zs i E ; 

i=m -\-1 i=m+l i=l i=l 


where ?'i < • • • < r p denote those of the integers jm+i, ■ ■ ■ ,jm+n that are < m, 
and si <■■■< s p those of the integers m + 1 ,..., m + n not included in the set 
(jrn+ 1, • • • ,jm+n)- If - Z z n is positive and y t < y[, that is, m < z[ for 

i = m+ l,...,m + n, then the difference Z z 'si ~ Z z r t is also positive and hence 
<f> satisfies (b). 

The same argument also shows that the rejection probability of the test is 
< a when the density of the variables is given by (5.46) with A < 0. The test is 
therefore equally appropriate if the hypothesis A = 0 is replaced by A < 0. 

Except for small values of the sample sizes A T ,. the amount of computation 
required to carry out the permutation test (5.53) is large. Computational methods 
are discussed by Green (1977), John and Robinson (1983b), Diaconis and Holmes 
(1994) and Chapter 13 of Good (1994), who has an extensive bibliography. 

One can relate the permutation test to the corresponding normal theory t-test 
as follows. On multiplying both sides of the inequality 

E® > 


by (1/m) + (1 /n) and subtracting (Z xi, + Z Vi)/ m J H le rejection region for 
c = 1 becomes y — x > C[T(z) ] or W = (y — JO/’v/ZZiW — ^) 2 > since 

the denominator of W is constant over S(z) and hence depends only on T(z). As 
was seen at the end of Section 5.3, this is equivalent to 


(y-x)/\]^ + l 


Y [zfci - x ) 2 + Z - y ) 2 /(m + n - 2 ) 


> C[T(z)\. 


(5.54) 


The rejection region therefore has the form of a t-test in which the constant cutoff 
point Co of (5.27) has been replaced by a random one. It turns out that when the 
hypothesis is true, so that the Z's are identically and independently distributed, 
and m/n is bounded away from zero and infinity as m and n tend to infinity, 
the difference between the random cutoff point C[T(Z)\ and Co is small in an 
appropriate asymptotic sense, and so the permutation test and the t-test given by 
(5.27) — (5.29) behave similarly in large samples. Such results will be developed 
in Section 15.2. the permutation test can be approximated for large samples by 
the standard t-test. Exactly analogous results hold for c > 1; the appropriate 
generalization of the two-sample t-test is provided in Problem 7.9. ■ 
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5.10 Randomization As A Basis For Inference 


The problem of testing for the effect of a treatment was considered in Section 5.3 
under the assumption that the treatment and control measurements Xi ,..., X m , 
and Yi,..., Y n constitute samples from normal distributions, and in Sections 5.8 
and 5.9 without relying on the assumption of normality. We shall now consider 
in somewhat more detail the structure of the experiment from which the data 
are obtained, resuming for the moment the assumption that the distributions 
involved are normal. 

Suppose that the experimental material consists of m + n patients, plants, 
pieces of material, or the like, drawn at random from the population to which the 
treatment could be applied. The treatment is given to n of these while the other 
m serve as controls. The characteristic that is to be influenced by the treatment 
is then measured in each case, leading to observations Xi ,..., X rn ; Y u ...,Y n . 

To be specific, suppose that the treatment is carried out by injecting a drug and 
that m + n ampules are assigned to the m + n patients. The ith measurement can 
be considered as the sum of two components. One, say Ui , is associated with the 
ith patient; the other, V», with the ith ampule and the circumstances under which 
it is administered and under which the measurements are taken. The variables 
Ui and Vi are assumed to be independently distributed, the Vs with normal 
distribution N(j),cr 2 ) or N(£,cr 2 ) as the ampule contains the drug or is one of 
those used for control. If in addition the U’s are assumed to constitute a random 
sample from a 2 ), it follows that the A'’s and Y’s are independently normally 
distributed with common variance cr 2 + a 2 and means 


£(X) =/* + £, E(Y)=n + V . 


Except for a change of notation their joint distribution is then given by (5.26), 
and the hypothesis rj = £ can be tested by the standard t -test 

Unfortunately, under actual experimental conditions, it is frequently not pos¬ 
sible to ensure that the patients or other experimental units constitute a random 
sample from the population of such units. They may be patients in a certain 
hospital at a given time, or volunteers for an experiment, and may constitute 
a haphazard rather than a random sample. In this case the U’s would have to 
be considered as unknown constants, since they are not obtained by any definite 
sampling procedure. This assumption is appropriate also in a different context. 
Suppose that the experimental units are all the machines in a shop or fields on a 
farm. If the experiment is performed only to determine the best method for this 
particular shop or farm, these experimental units are the only relevant ones; that 
is, a replication of the experiment would consist in comparing the two treatments 
again for the same machines or fields rather than for a new batch drawn at ran¬ 
dom from a large population. In this case the units themselves, and therefore the 
u’s, are constant. Under the above assumptions the joint density of the m + n 
measurements is 


(%/27rr r ) r 


exp 


1 


^{xi - m - £) 2 + - Um+j - vY 

l j =l 


Since the w’s are completely arbitrary, it is clearly impossible to distinguish be¬ 
tween H : rj = £ and the alternatives K : r} > £. In fact, every distribution of K 
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also belongs to H and vice versa, and the most powerful level-a test for testing 
H against any simple alternative specifying £, r/, a, and the u’s rejects H with 
probability a regardless of the observations. 

Data which could serve as a basis for testing whether or not the treatment 
has an effect can be obtained through the fundamental device of randomization. 
Suppose that the N = m + n patients are assigned to the N ampules at random, 
that is, in such a way that each of the N\ possible assignments has probability 
1 /N\ of being chosen. Then for a given assignment the N measurements are inde¬ 
pendently normally distributed with variance o 2 and means £ + Uj i (i = 1,..., m) 
and + Uj i (i = m + 1,..., m + n). The overall joint density of the variables 

(Zr,..., Z N ) = (Xi,... ,X m -Y u ..., Y n ) 


is therefore 


— y 

N\ ^ 



x exp 


1 



£) + £(» U jm + , 

i=l 


(5.55) 



where the outer summation extends over all N\ permutations (ji, ■ ■ •, Jjv) of 
(1 ,,N). Under the hypothesis tj = £ this density can be written as 


m (\/2 tuj) n exp 


1 

2^2 


N 





(5.56) 


where = Uj i + £, = Uj i + r/. 

Without randomization a set of y’s which is large relative to the a;-values could 
be explained entirely in terms of the unit effects Ui. However, if these are assigned 
to the y’s at random, they will on the average balance those assigned to the x’s. As 
a consequence, a marked superiority of the second sample becomes very unlikely 
under the hypothesis, and must therefore be attributed to the effectiveness of the 
treatment. 

The method of assigning the treatments to the experimental units completely 
at random permits the construction of a level-a test of the hypothesis r/ = £, 
whose power exceeds a against all alternatives y — £ > 0. The actual power of 
such a test will however depend not only on the alternative value of y — £, which 
measures the effect of the treatment, but also on the unit effects U;. In particular, 
if there is excessive variation among the u’s this will swamp the treatment effect 
(much in the same way as an increase in the variance o 2 would), and the test 
will accordingly have little power to detect any given alternative y — £. 

In such cases the sensitivity of the experiment can be increased by an approach 
exactly analogous to the method of stratified sampling discussed in Section 5.8. 
In the present case this means replacing the process of complete randomization 
described above by a more restricted randomization procedure. The experimental 
material is divided into subgroups, which are more homogeneous than the mate¬ 
rial as a whole, so that within each group the differences among the u’s are small. 
In animal experiments, for example, this can frequently be achieved by a division 
into litters. Randomization is then applied only within each group. If the ith group 
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contains Ni units, rn of these are selected at random to receive the treatment, and 
the remaining rm = Ni — m serve as controls (]T] Ni = N, ^ rm = m, ^ rii = n). 

An example of this approach is the method of matched pairs. Here the ex¬ 
perimental units are divided into pairs, which are as like each other as possible 
with respect to all relevant properties, so that within each pair the difference of 
the w’s will be as small as possible. Suppose that the material consists of n such 
pairs, and denote the associated unit effects (the C/’s of the previous discussion) 
by Ui, U [\...; U n , U' n . Let the first and second member of each pair receive the 
treatment or serve as control respectively, and let the observations for the ith 
pair be X t and Y t . If the matching is completely successful, as may be the case, 
for example, when the same patient is used twice in the investigation of a sleeping 
drug, or when identical twins are used, then U' = Ui for all i, and the density of 
the X’s and Y’s is 


1 



0^2 ^2( X i-^- U i) 2 + ^2(yi-V- u i) 2 


(5.57) 


The UMP unbiased test for testing H : 1 7 = £ against 17 > £ is then given in terms 
of the differences IT) = Y, — Xi by the rejection region 



i — w ) 2 > C. 


(5.58) 


(See Problem 5.48.) 

However, usually one is not willing to trust the assumption = Ui even after 
matching, and it again becomes necessary to randomize. Since as a result of the 
matching the variability of the u’s within each pair is presumably considerably 
smaller than the overall variation, randomization is carried out only within each 
pair. For each pair, one of the units is selected with probability | to receive the 
treatment, while the other serves as control. The density of the A'’s and Y’s is 
then 


{«p [-2^ +&- 1 - ^) 2 i 


+ exp 


~ [(zi - £ - u’if + (yt - 17 - u-i) 2 ] j. 


Under the hypothesis 17 = £, and writing 


(5.59) 


Zil = Xi, Zi2 = Vi, Cil—£ + Ui, Ci2=?? + Mi (i =!,...,«), 


this becomes 

2” ^ (VW) 2 " 6XP 

Here the outer summation extends over the 2 n points = (C 11 , • • ■, Cn. 2 ) f° r which 
(Ciii Ca) is either (Oi, O 2 ) or (C<2,C<0 


20 - 2 ^ 


=1.7=1 


(5.60) 
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5.11 Permutation Tests and Randomization 


It was shown in the preceding section that randomization provides a basis for 
testing the hypothesis rj = £ of no treatment effect, without any assumptions 
concerning the experimental units. In the present section, a specific test will be 
derived for this problem. When the experimental units are treated as constants, 
the probability density of the observations is given by (5.55) in the case of com¬ 
plete randomization and by (5.59) in the case of matched pairs. More generally, 
let the experimental material be divided into c subgroups, let the randomization 
be applied within each subgroup, and let the observations in the ith subgroup be 


(Zn, , ZiNi) = {Xil, • • • , X-irrn ; Y)l, . . . , Yi ni ) . 

For any point u = (#n,..., u c jv c ), let S(u) denote as before the set of 
Nil... N c \ points obtained from it by permuting the coordinates within each 
subgroup in all Nil . .. N c \ possible ways. Then the joint density of the Z’s given 
u is 

1 ^ 1 


E 

u'£S(u) 


Nil... N c \ N (y/2na) N 


(5.61) 
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and under the hypothesis of no treatment effect 


j=mj+1 


iv,c(d = 
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iVlL " iV ' c! C’fs (0 


exp 


2ff2 EE^ dj 


i= 1 3 =1 


. (5.62) 


It may happen that the coordinates of u or £ are not distinct. If then some 
of the points of S(u) or S(() also coincide, each should be counted with its 
proper multiplicity. More precisely, if the N\\... N c \ relevant permutations of 
Ni +... + N c coordinates are denoted by gk,k = 1,..., Ni !... N c !, then S(Q can 
be taken to be the ordered set of points gk(, k = 1,..., TVi!... N c \, and (5.62), 
for example, becomes 

g (v&r exp (-2^- 9 ‘ cl 

where |it | 2 stands for YTi=x u ij- 


Theorem 5.11.1 A necessary and sufficient condition for a critical function (f> 
to satisfy 


j <f(z)p rT x( z ) dz < a (dz = dzu ... dz c N c ) 


for all a > 0 and all vectors £ is that 

1 E - 


A r i!... N c \ 


a a.e. 


(5.63) 


(5.64) 


z'£S(z) 


The proof will be based on the following lemma. 
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Lemma 5.11.1 Let A be a set in N-space with positive Lebesgue measure n(A). 
Then for any t > 0 there exist real numbers a > 0 and £i,..., £jv, such that 

P{(X u ...,X N )e A}> l-e, 

where the X ’s are independently normally distributed with means E{Xi) = £» and 
variance = a 2 . 


Proof. Suppose without loss of generality that p,(A) < oo. Given any g > 0, 
there exists a square Q such that 


p(Q n A c ) < r/fj,(Q). 


This follows from the fact that almost every point of A is a density point, 9 or 
from the more elementary fact that a measurable set can be approximated in 
measure by unions of disjoint squares. Let a be such that 


and let 




ey/ N 

2 ) 



If (£i,..., £jv) is the center of Q , and if a = b/a = (1/2 a)[p(Q)] 1 ^ N , where 2 b is 
the length of the side of Q, then 
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{y/2ira) N 
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exp 


A c nQ c 
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2a 2 
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( \Z2wa) N 
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L V^J- 
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J —a 


exp ( ) dt 


dx i .. .dxN 
) 2 1 dx i .. .dxN 


On the other hand, 


(v^Trcr)-^ 

< 


t A c r\Q 


exp 


2(j2 Xi & 


dx i .. .dxN 


——L - p(A c 0 Q) < 

(V2na) N 2 ’ 


and by adding the two inequalities one obtains the desired result. ■ 
PROOF. [Proof of the theorem] Let <j> be any critical function, and let 


ip(z) 


1 

naTTTnj 


^ z ">- 

z' ES(z) 


If (5.64) does not hold, there exists rj > 0 such that rf)(z) > a 1 1 ) on a set A 
of positive measure. By the Lemma there exists a > 0 and £ = (Cn, ■ • •, £ c n c ) 


See, for example, Billingsley (1995), p.417. 
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such that P{Z £ A} > 1 — r/ when Z n,..., Z c jv c are independently normally 
distributed with common variance cr 2 and means E(Zij) = (ij. It follows that 


/ 


> 


f 

J A 


/- 


<1>(z)p<tx( z ) dz =sj / ip(z)pax(z) dz 


(\Z 2 na ) 1 

> (<x + v) 0 --v), 


■ exp 




(5.65) 



dz 


which is > a, since a + p < 1. This proves that (5.63) implies (5.64). The converse 
follows from the first equality in (5.65). ■ 


Corollary 5.11.1 Let H be the class of densities 

{Pax( z ) '■ a > 0, -oo < C ij < co}. 

A complete family of tests for H at level of significance a is the class of tests C 
satisfying 

N . i 1 am E 4 >{z’) = a a.e. (5.66) 

c ' z'es(z) 


Proof. The corollary states that for any given level-o test tj>o there exists an 
element <j> of C which is uniformly at least as powerful as 4> o. By the preceding 
theorem the average value of tj> o over each set S(z) is < a. On the sets for which 
this inequality is strict, one can increase <j >o to obtain a critical function <j> satis¬ 
fying (5.66), and such that 4>o(z) < (fo(z ) for all 2 . Since against all alternatives 
the power of <j> is at least that of </>o, this establishes the result. An explicit con¬ 
struction of (j>, which shows that it can be chosen to be measurable, is given in 
Problem 5.51. 

This corollary shows that the normal randomization model (5.61) leads ex¬ 
actly to the class of tests that was previously found to be relevant when the 
I/’s constituted a sample but the assumption of normality was not imposed. It 
therefore follows from Section 5.9 that the most powerful level-a test for testing 
(5.62) against a simple alternative (5.61) is given by (5.51) with h(z) equal to the 
probability density (5.61). If p — £ = A, the rejection region of this test reduces 
to 

— I" i c / Ni Ni 

E exp ^2 E ( E Zi i u 'n + A E ( z » - <4) 

u'eS(u) L i= 1 \j =1 j=rrii +1 

since both X] E z ij an d are constant on S(z) and therefore functions 

only of T(z). It is seen that this test depends on A and the unit effects Uij, so 
that a UMP test does not exist. 

Among the alternatives (5.61) a subclass occupies a central position and is 
of particular interest. This is the class of alternatives specified by the assump¬ 
tion that the unit effects m constitute a sample from a normal distribution. 
Although this assumption cannot be expected to hold exactly - in fact, it was 
just as a safeguard against the possibility of its breakdown that randomization 
was introduced - it is in many cases reasonable to suppose that it holds at least 


> C[T{z)}, (5.67) 
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approximately. The resulting subclass of alternatives is given by the probability 
densities 


1 

(\/2na) N 
x exp 


(5.68) 


1 

2(7^ 


c / mj 

£ £(* 


N i 

- Ui - 0 2 + ^ (Zij 

j=mi ~\-1 


-Ui-rj) 



These alternatives are suggestive also from a slightly different point of view. 
The procedure of assigning the experimental units to the treatments at random 
within each subgroup was seen to be appropriate when the variation of the u’s is 
small within these groups and is employed when this is believed to be the case. 
This suggests, at least as an approximation, the assumption of constant u tJ = m, 
which is the limiting case of a normal distribution as the variance tends to zero, 
and for which the density is also given by (5.68). 

Since the alternatives (5.68) are the same as the alternatives (5.52) of Section 
5.9 with m — £ = £i, Ui — rj = — A, the permutation test (5.53) is seen to 

be most powerful for testing the hypothesis r/ = £ in the normal randomization 
model (5.61) against the alternatives (5.68) with rj — £ > 0. The test retains 
this property in the still more general setting in which neither normality nor 
the sample property of the U’s is assumed to hold. Let the joint density of the 
variables be 


C 


e n 

u'€.S(u) i= 1 


n%i 

~ U ij 

.1 = 1 


0 n 


(5.69) 


with ft continuous a.e. but otherwise unspecified. 10 Under the hypothesis 
H : r/= £, this density is symmetric in the variables (zn,..., ZiN t ) of the ith 
subgroup for each i, so that any permutation test (5.48) has rejection probability 
a for all distributions of H. By Corollary 5.11.1, these permutation tests therefore 
constitute a complete class, and the result follows. ■ 


5.12 Randomization Model and Confidence 
Intervals 

In the preceding section, the unit responses Ui were unknown constants (parame¬ 
ters) which were observed with error, the latter represented by the random terms 
Vi. A limiting case assumes that the variation of the U’s is so small compared 
with that of the u’s that these error variables can be taken to be constant, i.e. 
that Vi = v. The constant v can then be absorbed into the u’s, and can therefore 
be assumed to be zero. This leads to the following two-sample randomization 
model: 

N subjects would give “true” responses ui,...,ujv if used as controls. The 
subjects are assigned at random, n to treatment and m to control. If the responses 


10 Actually, all that is needed is that /i, ■ • •, f c € T, where T is any family containing 
all normal distributions. 
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are denoted by X \...., X m and Yi ,..., Y n as before, then under the hypothesis 
H of no treatment effect, the X’s and Y’s are a random permutation of the u’s. 
Under this model, in which the random assignment of the subjects to treatment 
and control constitutes the only random element, the probability of the rejection 
region (5.54) is the same as under the more elaborate models of the preceding 
sections. 

The corresponding limiting model under the alternatives assumes that the 
treatment has the effect of adding a constant amount A to the unit response, so 
that the A"’s and Y’s are given by (wq ,... ; Ui m ; u » m+1 + A, ..., Ui m+n + A) for 
some permutation (*i,..., ijv) of (1,..., N). 

These models generalize in the obvious way to stratified samples. In particular, 
for paired comparisons it is assumed under H that the unit effects (ui,u'i) are 
constants, of which one is assigned at random to treatment and the other to 
control. Thus the pair (A’i,Y) is equal to (u;,iq) or ( u[,Ui ) with probability 
| each, and the assignments in the n pairs are independent; the sample space 
consists of 2 n points each of which has probability (|) n . Under the alternative, 
it is assumed as before that A is added to each treated subject, so that P{Xi = 
Ui, Yi = u'i + A) = P(Xi = it', Yi = m + A) = |. The distribution generated 
for the observations by such a randomization model is exactly the conditional 
distribution given T(z) of the preceding sections. In the two-sample case, for 
example, this common distribution is specified by the fact that all permutations 
of (Xi,..., X rn ; Yr — A,..., Y n — A) are equally likely. As a consequence, the 
power of the test (5.54) in the randomization model is also the conditional power 
in the two-sample model (5.45). As was pointed out in Section 4.4, the conditional 
power /3(A | T(z)) can be interpreted as an unbiased estimate of the unconditional 
power pp(A) in the two-sample model. The advantage of /3 (A | T(z)) is that it 
depends only on A, not on the unknown F. Approximations to (3 {A | T(z)) 
are discussed by J. Robinson (1973), G. Robinson (1982), John and Robinson 
(1983a), and Gabriel and Hsu (1983). 

The tests (5.53), which apply to all three models - the sampling model (5.46), 
the randomization model, and the intermediate model (5.69) - can be inverted in 
the usual way to produce confidence sets for A. We shall now determine these sets 
explicitly for the paired comparisons and the two-sample case. The derivations 
will be carried out in the randomization model. However, they apply equally in 
the other two models, since the tests, and therefore the associated confidence 
sets, are identical for the three models. 

Consider first the case of paired observations ( Xi,yi ), i = 1 The one¬ 

sided test rejects H : A = 0 in favor of A > 0 when (uT, yi is among the K 
largest of the 2 n sums obtained by replacing y t by Xi for all, some, or none of the 
values i = 1,..., n. (It is assumed here for the sake of simplicity that a = K/2 n , 
so that the test requires no randomization to achieve the exact level a.) Let 
di = yi — Xi = 2 y-i — ti, where t; = xi + yi is fixed. Then the test is equivalent 
to rejecting when ^ di is one of the K largest of the 2” values 5Z±di, si nce 
an interchange of yi with Xi is equivalent to replacing di by — di. Consider now 
testing H : A = Ao against A > Ao. The test then accepts when J ~2(di — Ao) 
is one of the l = 2” — K smallest of the 2 n sums ±(di — Ao), since it is now 
yi — Ao that is being interchanged with Xi. We shall next invert this statement, 
replacing Ao by A, and see that it is equivalent to a lower confidence bound for 
A. 



5.12. Randomization Model and Confidence Intervals 


189 


In the inequality 

- A) < £ [±(di - A)], (5.70) 

suppose that on the right side the minus sign attaches to the (di — A) with 
i = ii,... ,i r and the plus sign to the remaining terms. Then (5.70) is equivalent 
to 

dq 4- + di r — rA < 0, or -—— < A. 

r 

Thus, — A) is among the l smallest of the '^2±(di — A) if and only if at 

least 2 n — l of the M = 2 n — 1 averages (■ d q + ••• + di r )/r are < A, i.e. if 
and only if S( K ) < A, where <5(i) < • • • < <5 (m) is the ordered set of averages 
(dq + • • • + di r )/r, r = 1,... ,M. This establishes 5(k) as a lower confidence 
bound for A at confidence level 7 = K/2 n . [Among all confidence sets that are 
unbiased in the model (5.46) with m; = rq = 1 and c = n, these bounds minimize 
the probability of falling below any value A' < A for the normal model (5.52).] 
By putting successively K = 1, 2,..., 2 n , it is seen that the M + 1 intervals 

(—00, <5(i)), (<5(i), <5(2 )), •.., (<5(m_i), {8m, 00 ) (5-71) 

each have probability 1/(M + 1) = l/2 n of containing the unknown A. The two- 
sided confidence intervals (8(k), <5(2" - k)) with 7 = (2" 1 — K)/2 n 1 correspond 
to the two-sided version of the test (5.53) with error probability (1 — y)/2 in each 
tail. A suitable subset of the points <5(i),..., <5 (m) constitutes a set of confidence 
points in the sense of Section 3.5. 

The inversion procedure for the two-group case is quite analogous. Let 
(n,..., Xm, yi, ■ . •, y n ) denote the m control and n treatment observations, and 
suppose without loss of generality that m < n. Then the hypothesis A = Ao is 
accepted against A > Ao if Xq=i (Vj ~ Ao) is among the l smallest of the ( m + n ) 
sums obtained by replacing a subset of the (jq — Ao)’s with x’s. The inequality 

^2iVj - A 0 ) < (xq H-b x ir ) + [y H 4-b y jn _ r - (n - r)A], 

with (*i,..., i r , j 1 ,... ,j n -r) a permutation of (1 ,..., n), is equivalent to j/q + 
-) -yi r — rA 0 < xq -t-+ x ir , or 

xq,...,i r Aq. (5.(2) 

Note that the number of such averages with r > 1 (i.e. omitting the empty set of 
subscripts) is equal to 

(Problem 5.57). Thus, H : A = Ao is accepted against A > Ao at level 
a = 1 — l/(M + 1) if and only if at least K of the M differences (5.72) are less 
than Ao, and hence if and only if S(k) < Ao, where <5(i) < ••• < 8(m) denote 
the ordered set of differences (5.72). This establishes 8(k) as a lower confidence 
bound for A with confidence coefficient 7 = 1 — a. 

As in the paired comparisons case, it is seen that the intervals (5.71) each have 
probability 1 /{M + 1) of containing A. Thus, two-sided confidence intervals and 
standard confidence points can be derived as before. For the generalization to 
stratified samples, see Problem 5.58. 
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Algorithms for computing the order statistics 5(i),. • •, <5 (m) in the paired- 
comparison and two-sample cases are discussed by Tritcliler (1984); also see 
Garthwaite (1996). If M is too large for the computations to be practicable, 
reduced analyses based on either a fixed or random subset of the set of all M + 1 
permutations are discussed, for example, by Gabriel and Hall (1983) and Vadi- 
veloo (1983). [See also Problem 5.60(i).] Different such methods are compared by 
Forsythe and Hartigan (1970). For some generalizations, and relations to other 
subsampling plans, see Efron (1982, Chapter 9). 


5.13 Testing for Independence in a Bivariate 
Normal Distribution 


So far, the methods of the present chapter have been illustrated mainly by the 
two-sample problem. As a further example, we shall now apply two of the for¬ 
mulations that have been discussed, the normal model of Section 5.3 and the 
nonparametric one of Section 5.8, to the hypothesis of independence in a bivariate 
distribution. 

The probability density of a sample (Xi, Yf),..., (X n , Y„) from a bivariate 
normal distribution is 


1 

- , — exp 

(27rary/l — p 2 ) n 


2(1 -p 2 )(a 2 ^ Xi ^ (5 ' 73) 

~ - Ofe -»?) + ^ Ef yi-v) 2 ) ■ 


Here (t ;,<r 2 ) and (? 7 ,r 2 ) are the mean and variance of X and Y respectively, and 
p is the correlation coefficient between A' and Y. The hypotheses p < po and 
p — po for arbitrary po cannot be treated by the methods of the present chapter, 
and will be taken up in Chapter 6. For the present, we shall consider only the 
hypothesis p = 0 that X and Y are independent, and the corresponding one-sided 
hypothesis p < 0. 

The family of densities (5.73) is of the exponential form (1) with 


U = 7i=^TY 2 , T 2 = El¬ 

and 


= T 4 = E Yi 


® ~ 2(t 2 (1—p 2 ) ’ — 2^2(1 l p 2) , 

^3=^(4,-^), *4 = dj? (£ - £) , 

The hypothesis H : p < 0 is equivalent to 9 < 0. Since the sample correlation 
coefficient 


R = 


J2(Xj - X)(Y - Y) 

y/'n.Xi-XYY.Vi-Y)' 


is unchanged when the X ; and Y) are replaced by (X; — Q/& and (Y) — p)/t, 
the distribution of R does not depend on £, p, a, or r, but only on p. For 6 = 0 
it therefore does not depend on $i,...,# 4 , and hence by Theorem 5.1.2, R is 
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independent of (Ti,..., T 4 ) when 9 = 0. It follows from Theorem 5.1.1 that the 
UMP unbiased test of H rejects when 

R > Co, (5.74) 


or equivalently when 


R 

V(l- J R 2 )/(n-2) 


> K 0 . 


(5.75) 


The statistic R is linear in U, and its distribution for p = 0 is symmetric about 
0. The UMP unbiased test of the hypothesis p = 0 against the alternative p ^ 0 
therefore rejects when 


|fl| 

s /(l-R?)/(n-2) 


> AT. 


(5.76) 


Since \/n — 2R/\/l — R? has the f-distribution with n — 2 degrees of freedom 
when p = 0 (Problem 5.64), the constants A'o and AT in the above tests are 
given by 

r 00 r°° a 

/ t n - 2 (y)dy = a and / t n - 2 (y) dy = - (5.77) 

J K 0 J Jfi ^ 

Since the distribution of f? depends only on the correlation coefficient p, the same 
is true of the power of these tests. 

Some large sample properties of the above test will be examined in Problem 
(11.64). In particular, if (AT, Y t ) is not bivariate normal, the level of the above test 
is approximately a in large samples under the hypothesis H 1 that AT and Y are 
independent, but not necessarily under the hypothesis H 2 that the correlation 
between AT and Y, is 0. For the nonparametric model Hi, one can obtain an 
exact level-a unbiased test of independence in analogy to the permutation test 
of Section 5.8. For any bivariate distribution of (X,Y), let Y x denote a random 
variable whose distribution is the conditional distribution of Y given x. We shall 
say that there is positive regression dependence between X and Y if for any 
x < x' the variable Y x i is stochastically larger than Y x . Generally speaking, 
larger values of Y will then correspond to larger values of X ; this is the intuitive 
meaning of positive dependence. An example is furnished by any normal bivariate 
distribution with p > 0. (See Problem 5.68.) Regression dependence is a stronger 
requirement than positive quadrant dependence, which was defined in Problem 
4.28. However, both reflect the intuitive meaning that large (small) values of Y 
will tend to correspond to large (small) values of X. 

As alternatives to Hi consider positive regression dependence in a general 
bivariate distribution possessing a density. To see that unbiasedness implies sim¬ 
ilarity, let FT, F 2 be any two univariate distributions with densities fi, f 2 and 
consider the one-parameter family of distribution functions 


Fi(x)F 2 {y){l + A[l-Fi(x)][l-F 2 {y)}}, 0 < A < 1. (5.78) 

This is positively regression dependent (Problem 5.69), and by letting A —> 0 one 
sees that unbiasedness of <j> against these distributions implies that the rejection 
probability is a when X and Y are independent, and hence that 

[ 4>{xi,.. .,x„\yi, ..., y n )fi{xi) ■ ■ ■ Mx n )f 2 (yi) ■ • • f 2 (y n ) dxdy = a 
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for all probability densities fi and / 2 . By Theorem 5.8.1 this in turn implies 

(rr !) 2 ^ ^ Xil ’ ‘ ‘ ‘ ’ Xin ’ ’' ‘ ‘ ’ 2b”) = a - 

Here the summation extends over the (n!) 2 points of the set S(x,y), which is 
obtained from a fixed point ( x,y ) with x = (*i,..., x n ), y = (yi, ■ ■ ■ ,y n ) by 
permuting the ^-coordinates and the y-coordinates, each among themselves in 
all possible ways. 

Among all tests satisfying this condition, the most powerful one against the 
normal alternatives (5.73) with p > 0 rejects for the k' largest values of (5.73) 
in each set S(x,y), where fc'/(n!) 2 = a. Since Y x2 , Yvli Y x ii YVi> are all 
constant on S(x, y), the test equivalently rejects for the k' largest values of Y x iVi 
in each S(x, y). 

Of the (n!) 2 values that the statistic Y^iYi takes on over S(x,y), only n! are 
distinct, since the statistic remains unchanged if the X’s and Y’s are subjected 
to the same permutation. A simpler form of the test is therefore obtained, for 
example by rejecting Hi for the k largest values of Y x (i)Vu i of each set S(x,y), 
where Xu) < ■ ■ ■ < X ( n ) and k/n\ = a. The test can be shown to be unbiased 
against all alternatives with positive regression dependence. (See Problem 6.62.) 

In order to obtain a comparison of the permutation test with the standard 
normal test based on the sample correlation coefficient R, let T(X, Y ) denote the 
set of ordered X’s and Y’s 

T(X, Y) = (X (1) ,..., X (n) ; Y (1) ,..., Y (n) ). 

The rejection region of the permutation test can then be written as 

J^XiY >C[T(X,Y)]. 

or equivalently as R > A'[T(A', Y)]. It again turns out that the difference between 
K[T(X, Y)] and the cutoff point Co of the corresponding normal test (5.74) tends 
to zero in an appropriate sense. Such results are developed in Section 15.2; also 
see Problem 15.13. For large n, the standard normal test (5.74) therefore serves 
as an approximation for the permutation test. 


5.14 Problems 

Section 5.2 

Problem 5.1 Let AT,..., X n be a sample from A r (^, a 2 ). The power of Student’s 
f-test is an increasing function of £/cr in the one-sided case H : £ < 0, K : £ > 0, 
and of |£|/cr in the two-sided case H : £ = 0, K : £ ^ 0. 

[If 

the power in the two-sided case is given by 
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and the result follows from the fact that it holds conditionally for each fixed value 
of S/a.] 


Problem 5.2 In the situation of the previous problem there exists no test for 
testing H : £ = 0 at level a, which for all a has power > /3 > a against the 
alternatives (£,cr) with / = > 0. 

[Let /3(£i ,a) be the power of any level a test of H, and let /3(a) denote the 
power of the most powerful test for testing £ = 0 against £ = £i when a is known. 
Then info- (3(/i,a) < info (3(a) = a.] 


Problem 5.3 (i) Let Z and V be independently distributed as N(5, 1) and 

X 2 with / degrees of freedom respectively. Then the ratio Z -r \/V / f has 
the noncentral t-distribution with / degrees of freedom and noncentrality 
parameter 5, the probability density of which is 11 


Ps(t) 


2 l ( /- 1) r(I/)^7io 


y 




x exp 



exp 


1 
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dy 


dy 


(5.79) 


or equivalently 
Ps(t) = 


2l ( /- 1) r(|/)V5F7 


ex P -o 


i ft 2 


2 f + t* 
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o(/+ 1 ) r 00 
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v exp 
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-x \ v ~ 




dv. 


Another form is obtained by making the substitution w = t^/y/y/J in 
(5.79). 


(ii) If Xi,, X n are independently distributed as N(£,a 2 ), then \fnX 
-j- y/J))(AT — A') 2 /(n — 1) has the noncentral t-distribution with n — 1 de¬ 
grees of freedom and noncentrality parameter 5 = yfn^/a. In the case 
<5 = 0, show that t-distribution with n — 1 degrees of freedom is given by 
(5.18). 

[(i): The first expression is obtained from the joint density of Z and V by 
transforming to t = 2 -P y/v/ f and v.] 


Problem 5.4 Let X \,..., X n be a sample from N(t), a 2 ). Denote the power of 
the one-sided f-test of H : / < 0 against the alternative £/a by /3(£/cr), and by 
(3*(f//a) the power of the test appropriate when cr is known. Determine /3(//a) 
for n = 5, 10, 15, a = .05, //a = .07, 0.8, 0.9, 1.0, 1.1, 1.2, and in each case 
compare it with (3*(//a). Do the same for the two-sided case. 


Problem 5.5 Let Z \,..., Z n be independently normally distributed with com¬ 
mon variance a 2 and means E(Zi) = </i(i = 1,..., s), E(Zi) = 0 (i = s+1,..., n). 


11 A systematic account of this distribution can be found in in Owen (1985) and 
Johnson, Kotz and Balakrishnan (1995). 
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There exist UMP unbiased tests for testing £i < £i and £i = £i given by the 
rejection regions 


Zi - Ci 


> Co and 




E Zf/(n-s) 

i=s-\-l 


> c. 


E Zf/(n-s) 

i=s -\-1 


When Ci = Ci; the test statistic has the t-distribution with n — s degrees of 
freedom. 


Problem 5.6 Let AT,..., X n be independently normally distributed with com¬ 
mon variance a 2 and means Ci, ■ ■ ■, Cn, and let Zi = E" = i a ijXj-> be an orthogonal 
transformation (that is, EEi = 1 or 0 as j = k or j ^ k). The Z’s are 

normally distributed with common variance a 2 and means C i = E a b'Cj- 

[The density of the Z’s is obtained from that of the A’s by substituting Xi = 
5, where (by) is the inverse of the matrix (ay), and multiplying by the 
Jacobian, which is 1.] 

Problem 5.7 If AT,..., X n is a sample from IV(£, a 2 ), the UMP unbiased tests 
of C < 0 and £ = 0 can be obtained from Problems 5.5 and 5.6 by making an 
orthogonal transformation to variables Zi ,..., Z n such that Z\ = y/nX. 

[Then 

n n n n 

= z i - Z * = Y1 x i - nR2 = 

i =2 i= 1 i=1 i=1 

Problem 5.8 Let Xi, X 2 , ... be a sequence of independent variables distributed 
as AI(C,cr 2 ), and let Y n = [nX n +1 — (Ai + • • • + X n )\/y/n{n + 1) . Then the 
variables li,p 2 , • • • are independently distributed as N(0, a 2 ). 

Problem 5.9 Let N have the binomial distribution based on 10 trials with suc¬ 
cess probability p. Given N = n, let AT, • • •, X n be i.i.d. normal with mean 9 and 
variance one. The data consists of (N, AT, • • •, Ajv). 

(i) . If p has a known value p 0 , show there does not exist a UMP test of 9 = 0 
versus 9 > 0. [In fact, a UMPU test does not exist either.] 

(ii) . If p is unknown (taking values in (0,1)), find a UMPU test of 9 = 0 versus 
9 > 0. 

Problem 5.10 As in Example 3.9.2, suppose A' is multivariate normal with 
unknown mean £ = (£ 1 ,... ,£fc) T and known positive definite covariance matrix 
E. Assume a = (ai,..., ak) T is a fixed vector. The problem is to test 

k k 

E ai£i = <5 vs. K : afc£i 5 . 

i =1 i =1 

Find a UMPU level a test. Hint: First consider E = H, the identity matrix. 

Problem 5.11 Let AT =£ + [/.,, and suppose that the joint density / of the U’s 
is spherically symmetric , that is, a function of JT U 2 only, 

f(u i,...,m„) =q(^2 l u 2 i ) . 
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Show that the null distribution of the one-sample t-statistic is independent of 
q and hence is the same as in the normal case, namely Student’s t with n — 1 
degrees of freedom. Hint: Write t n as 

n 1/2 W l /y / EXj 

\]22( x i ~ X n ) 2 /(n - 1) E x j ’ 

and use the fact that when £ = 0, the density of Xi,... ,X n is constant over 
the spheres 22 — c and hence the conditional distribution of the variables 

x i/\J22 X j given 22 X j = c is uniform over the conditioning sphere and hence 
independent of q. Note. This model represents one departure from the normal- 
theory assumption, which does not affect the level of the test. The effect of a 
much weaker symmetry condition more likely to arise in practice is investigated 
by Efron (1969). 


Section 5.3 

Problem 5.12 Let Xi ,.... X n and Yi,... ,Y n be independent samples from 
iV(f, o 2 ) and N(rj, r 2 ) respectively. Determine the sample size necessary to obtain 
power > (3 against the alternatives t/o > A when a = .05, (3 = .9, A = 1.5, 2,3, 
and the hypothesis being tested is H : t/o < 1. 


Problem 5.13 If m = n, the acceptance region (5.23) can be written as 

Si A 0 S% \ < 1-C 


A 0 Si’ Si 


C 


where S\ = 22( x i ~ x ) 2 > Si = 2205 — Y) 2 and where C is determined by 


r c a 

J B n -i, n _i (w) dw = —. 


Problem 5.14 Let X\,...,X m and Y\,...,Y n be samples from N(£,cr 2 ) 
and N(r/,i j 2 ). The UMP unbiased test for testing rj — £ = 0 can be 

obtained through Problems 5.5 and 5.6 by making an orthogonal transfor¬ 
mation from (Xi,... X m , Yi,... Y n ) to (Zi ,..., Z m+n ) such that Z\ = (Y — 

X)/y/TJm + (1/n), Z 2 = (22 x i + 22 Yi)/y/m + n. 


Problem 5.15 Exponential densities. Let Xi,...,X n , be a sample from a 
distribution with exponential density a~ 1 e~^ x ~ b ^ a for x > b. 


(i) For testing a 
region 


1 there exists a UMP unbiased test given by the acceptance 


Ci < 2 £[* - min(xi,... ,x„)] < C 2 , 


where the test statistic has a \ 2 -distribution with 2n — 2 degrees of freedom 
when a = 1, and Ci, C 2 are determined by 



xln (y)dy = l-a. 
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(ii) For testing b = 0 there exists a UMP unbiased test given by the acceptance 
region 


0 < 


nmin(xi,... ,®„) 
J2[ x i ~ min(xi, 


< C. 


When 6 = 0, the test statistic has probability density 


p(u) 


n — 1 

(1 + u) n ’ 


u > 0. 


[These distributions for varying 6 do not constitute an exponential family, and 
Theorem 4.4.1 is therefore not directly applicable. For (i), one can restrict atten¬ 
tion to the ordered variables Xnj < • • • < Xr n \, since these are sufficient for a and 
b, and transform to new variables Z\ = n.Yp), Zi = (n — i + 1)[A(;) — A'(;_i)] for 
* = 2,.. ., n, as in Problem 2.15. When a = 1, Z\ is a complete sufficient statistic 
for 6, and the test is therefore obtained by considering the conditional problem 
given z\. Since is independent of Z \, the conditional UMP unbiased test 

has the acceptance region C\ < $^"=2 — ^2 f° r eac h z h an( i the result follows. 

For (ii), when 6 = 0, is a complete sufficient statistic for a, and the 

test is therefore obtained by considering the conditional problem given Zi. 

The remainder of the argument uses the fact that Z\ / )U a = i Z% is indepen¬ 
dent of 5Z™=i when 6 = 0, and otherwise is similar to that used to prove 
Theorem 5.1.1.] 


Problem 5.16 Let Xi ,..., X n be a sample from the Pareto distribution P(c, r), 
both parameters unknown. Obtain UMP unbiased tests for the parameters c 
and t. [Problems 5.15 and 3.8.] 


Problem 5.17 Extend the results of the preceding problem to the case, consid¬ 
ered in Problem 3.29, that observation is continued only until A'^j,..., A( r ) have 
been observed. 


Problem 5.18 Gamma two-sample problem. Let Ai,...A m ; Y\,...,Y n be 
independent samples from gamma distributions P(<ji, 6i), r(g2, 62) respectively. 

(i) If gi, g 2 are known, there exists a UMP unbiased test of H : 62 = 61 against 
one- and two-sided alternatives, which can be based on a beta distribution. 
[Some applications and generalizations are discussed in Lentner and 
Buehler (1963).] 

(ii) If gi , gi are unknown, show that a UMP unbiased test of H continues to 
exist, and describe its general form. 

(iii) If 62 = 61 = 6 (unknown), there exists a UMP unbiased test of <72 = g 1 
against one- and two-sided alternatives; describe its general form. 

[(i): If Yi(i = 1,2) are independent r(<?i,6), then Yi + Y 2 is F(<?i + 32,6) and 
Yi/(Yi + Y 2 ) has a beta distribution.] 
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Problem 5.19 Inverse Gaussian distribution. 12 Let Xi,...,X n be a sample 
from the inverse Gaussian distribution J(/r, r), both parameters unknown. 

(i) There exists a UMP unbiased test of v < vo against /r > vo, which rejects 
when A' > C (// (A, + 1/A!/)], and a corresponding UMP unbiased test of 
V = Vo against fj-o ^ Vo- 

[The conditional distribution needed to carry out this test is given by 
Chhikara and Folks (1976).] 

(ii) There exist UMP unbiased tests of H : r = to against both one- and 
two-sided hypotheses based on the statistic V = 52(1/A', — 1/A'). 

(iii) When t — to, the distribution of ToU is Xn-i- 
[Tweedie (1957).] 

Problem 5.20 Let Xi,... ,X m and Yi,..., Y n be independent samples from 
I(v,t) and I(y,r) respectively. 

(i) There exist UMP unbiased tests of T 2 /T 1 against one- and two-sided 
alternatives. 

(ii) If t = a, there exist UMP unbiased tests of v / v against one- and two-sided 
alternatives. 

[Chhikara (1975).] 


Problem 5.21 Suppose A' and Y are independent, normally distributed with 
variance 1, and means £ and 77 , respectively. Consider testing the simple null 
hypothesis £ = 77 = 0 against the composite alternative hypothesis £ > 0 , 77 > 0 . 
Show that a UMPU test does not exist. 


Section 5-4 

Problem 5.22 On the basis of a sample X = (A'i,..., X n ) of fixed size from 
JV(£, a 2 ) there do not exist confidence intervals for £ with positive confidence 
coefficient and of bounded length . 13 

[Consider any family of confidence intervals <5(A') ± L/2 of constant length L. 
Let £ 1 ,.. .£ 2 n be such that |£i — £,| > L whenever i ^ j. Then the sets Si{x : 
|5(*) — £;| < L/2} (i = 1, ...,21V) are mutually exclusive. Also, there exists 
(Jo > 0 such that 

|P 5i , CT {.Y G Si} - P ?llCT {X £ &}| < for a > no, 


12 For additional information concerning inference in inverse Gaussian distributions, 
see Folks and Chhikara (1978) and Johnson, Kotz and Balakrishnan (1994, volume 1). 

13 A similar conclusion holds in the problem of constructing a confidence interval for the 
ratio of normal means (Fieller’s problem), as discussed in Koschat (1987). For problems 
where it is impossible to construct confidence intervals with finite expected length, see 
Gleser and Hwang (1987). 
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as is seen by transforming to new variables Yj = (Xj — £1 )/ct and applying 
Lemmas 5.5.1 and 5.11.1 of the Appendix. Since mini P^ 1 i(T {A' £ Si} < 1/(2 N), 
it follows for (j > cto that mini P( i ,tr{X £ Si} < 1/iV, and hence that 

The confidence coefficient associated with the intervals S(X) ± L/2 is therefore 
zero, and the same must be true a fortiori of any set of confidence intervals of 
length < L.] 


Problem 5.23 Stein’s two-stage procedure. 


(i) If mS 2 /a 2 has a x 2 = distribution with m degrees of freedom, and if 
the conditional distribution of Y given S = s is 1V(0, a 2 / S 2 ), then Y has 
Student’s f-distribution with m degrees of freedom. 

(ii) Let Xi,X 2 ,... be independently distributed as N(£,a 2 ). Let Xo = 
E”= i Xi/no, S 2 = E”=i(^i ~ X 0 ) 2 /(n 0 - 1), and let oi = • • • = a no = a, 
a no +i = • • ■ = a n = b and n > no be measurable functions of S. Then 

n 

E M x i ~ ?) 

has Student’s distribution with no — 1 degrees of freedom. 

(iii) Consider a two-stage sampling scheme in which S 2 is computed from 
an initial sample of size no, and then n — no additional observations are 
taken. The size of the second sample is such that 

S^ 

c 

where c is any given constant and where [y\ denotes the largest integer 
> y. There then exist numbers ai,..., a„ such that oi = • • • = a no , a no +1 = 
• • • a„, E"=i a i = 1, EF=i a i = c/S 2 . It follows from (ii) that E”=i a i{Xi ~ 
£)/\fc l las Student’s f-distribution with no — 1 degrees of freedom. 


n = max < no + 1 


(iv) The following sampling scheme f} 2 , which does not require that the second 
sample contain at least one observation, is slightly more efficient than JE, 
for the applications to be made in Problems 5.24 and 5.25. Let no, S 2 , and 
c be defined as before; let 


n = max 




+ 1 


ai = 1/n (i = 1 ,... ,n), and X = E?=i a iXi ■ Then ^/n{X — ^)/S has again 
the t-distribution with no — 1 degrees of freedom. 


[(ii): Given S = s, the quantities a, 6 , and n are constants, E"=i a i{Xi — £) = 
noa(Xo — £) is distributed as N(0,noa 2 cr 2 ), and the numerator of Y is therefore 
normally distributed with zero mean and variance a 2 E"=i a ?- r ^'^ le resu H llow 
follows from (i).] 


Problem 5.24 Confidence intervals of fixed length for a normal mean. 
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(i) In the two-stage procedure defined in part (iii) of the preceding prob¬ 
lem, let the number c be determined for any given L > 0 and 0 < 7 < 1 
by 

fL/ 2^5 

/ t no -i(y)dy = 7 , 

where t no -1 denotes the density of the f-distribution with no — 1 degrees of 
freedom. Then the intervals a iXi i L /2 are confidence intervals for 

£ of length L and with confidence coefficient 7 . 


(ii) Let c be defined as in (i), and let the sampling procedure be ]~[ 2 as defined 
in part (iv) of Problem 5.23. The intervals X ±L/2 are then confidence in¬ 
tervals of length L for £ with confidence coefficient > 7 , while the expected 
number of observations required is slightly lower than under JP- 

[(i): The probability that the intervals cover £ equals 




2-v/c 


< 


E ai(Xi-Z) 

i= 1 


< 


2 yfc 


> = T 


(ii): The probability that the intervals cover £ equals 




f y/n\X - £| 


< 


\fnL \ 
2 S } 


> 


f y/n\X - £| 


< 


L 

VcJ 


U 


7-] 


Problem 5.25 Two-stage t-tests with power independent of a. 


(i) For the procedure jp with any given c, let C be defined by 



t„ 0 -i(y)dy = a. 


Then the rejection region (E"=i a iXi — fo)/y/c > C defines a level-a test 
of H : £ < £0 with strictly increasing power function /? c (£) depending only 
on £. 


(ii) Given any alternative £1 and any a < j3 < 1, the number c can be chosen 
so that /3c (Ci) = (3- 

(iii) The test with rejection region ^/n(X — fo)/S > C based on ]£[,> and the 
same c as in (i) is a level-a test of H which is uniformly more powerful 
than the test given in (i). 

(iv) Extend parts (i)-(iii) to the problem of testing £ = £0 against £ £q. 


[(i) and (ii): The power of the test is 

&(£)=/ t„ 0 -i{y)dy. 

0 )/yfc 

(iii): This follows from the inequality y/n\f, — Co|/*S' > |£ — £o| /\fc.\ 


Problem 5.26 Let S(x) be a family of confidence sets for a real-valued pa¬ 
rameter 6, and let /^[S(a:)] denote its Lcbesgue measure. Then for every fixed 
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distribution Q of A' (and hence in particular for Q = Pg 0 where do is the true 
value of 9) 

E Q { l x[S(X)]}= [ Q{6 € S(X)}dO 

■loj=e o 

provided the necessary measurability conditions hold. 

[The identity is known as the Ghosh-Pratt identity; see Ghosh (1961) and Pratt 
(1961a). To prove it, write the expectation on the left side as a double integral, 
apply Fubini’s theorem, and note that the integral on the right side is unchanged 
if the point 6 = 6o is added to the region of integration.] 

Problem 5.27 Use the preceding problem to show that uniformly most accurate 
confidence sets also uniformly minimize the expected Lcbesgue measure (length 
in the case of intervals) of the confidence sets. 14 


Section 5.5 

Problem 5.28 Let Xi ,..., X n be distributed as in Problem 5.15. Then the most 
accurate unbiased confidence intervals for the scale parameter a are 
2 _^ 2 _^ 

Sr - min(xr,.. .,£„)] < a < — y'fcj - min(*i,... ,*„)]. 

02 ' C l z —' 

Problem 5.29 Most accurate unbiased confidence intervals exist in the follow¬ 
ing situations: 

(i) If X, Y are independent with binomial distributions 6(pi, m) and b(p 2 , m), 
for the parameter piq 2 /p 2 qi- 

(ii) In a 2 x 2 table, for the parameter A of Section 4.6. 

Problem 5.30 Shape parameter of a gamma distribution. Let A'i,..., X n be a 
sample from the gamma distribution r(p, 6) defined in Problem 3.34. 

(i) There exist UMP unbiased tests of H : g < go against g > go and of 
H' : g = go against g ^ go, and their rejection regions are based on 
W = lliAVA'). 

(ii) There exist uniformly most accurate confidence intervals for g based on W. 

[Shorack (1972).] 

Notes. 

(1) The null distribution of W is discussed in Bain and Engelhardt (1975), 
Glaser (1976), and Engelhardt and Bain (1978). 

(2) For g = 1, T(g,b) reduces to an exponential distribution, and (i) becomes 
the UMP unbiased test for testing that a distribution is exponential against 
the alternative that it is gamma with g > 1 or with g ^ 1. 


14 For the corresponding result concerning one-sided confidence bounds, see Madansky 
(1962). 
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(3) An alternative treatment of this and some of the following problems is 
given by Bar-Lev and Reiser (1982). 

Problem 5.31 Scale parameter of a gamma distribution. Under the assumptions 
of the preceding problem, there exists 

(i) A UMP unbiased test of H : b < bo against b > bo which rejects when 

E^>c(n,x i ). 

(ii) Most accurate unbiased confidence intervals for b. 

[The conditional distribution of E -E given X % , which is required for carrying 
out this test, is discussed by Engelliardt and Bain (1977).] 

Problem 5.32 In Example 5.5.1, consider a confidence interval for a 2 of the 
form I = [dn 1 Sn,Cn X Sn\, where S 2 = Ei(-E — X) 2 and c„ < d n are constants. 
Subject to the level constraint, choose c„ and d n to minimize the length of I. 
Argue that the solution has shorter length that the uniformly most accurate 
one; however, it is biased and so does not uniformly improve the probability 
of covering false values. [The solution, given in Tate and Klett (1959), satisfies 
xl +3 (Cn) = xl+ 3 {dn) and / c d J* xl-i(y)dy = 1 - a, where xl(y) denotes the Chi- 
squared density with n degrees of freedom. Improvements of this interval which 
incorporate X into their construction are discussed in Cohen (1972) and Shorrock 
(1990); also see Goutis and Casella (1991).] 


Section 5.6 

Problem 5.33 (i) Under the assumptions made at the beginning of Section 

5.6, the UMP unbiased test of H : p = po is given by (5.44). 

(ii) Let (p, p ) be the associated most accurate unbiased confidence intervals for 
p = 07 + 6(5, where p = p(a,b ), p = p(a,b). Then if /i and fa are increasing 
functions, the expected value of fi(\p — p\) + fa(\p — p|) is an increasing 
function of a 2 /n + b 2 . 

[(i): Make any orthogonal transformation from yi,...,y n to new variables 
such that zi = E d bv i + (a/n)]y;/ a/( a 2 /n) + 6 2 , z 2 = E i( av i - 
b)yi/V a 2 + nb 2 , and apply Problems 5.5 and 5.6. 

(ii): If a 2 /n + b 2 < a^/n + b^, the random variable |p(a 2 ,& 2 ) — p\ is stochastically 
larger than |p(oi, 6i) — p\, and analogously for p.\ 


Section 5.7 

Problem 5.34 Verify the posterior distribution of 0 given x in Example 5.7.1. 

Problem 5.35 If Xi,..., A'„, are independent N{9 ,1) and 9 has the improper 
prior 7 t(9) = 1, determine the posterior distribution of 9 given the X’s. 


Problem 5.36 Verify the posterior distribution of p given x in Example 5.7.2. 
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Problem 5.37 In Example 5.7.3, verify the marginal posterior distribution of £ 
given x. 

Problem 5.38 In Example 5.7.4, show that 

(i) the posterior density ^ r(cr | x) is of type (c) of Example 5.7.2; 

(ii) for sufficiently large r, the posterior density of a r given x is no longer of 
type (c). 

Problem 5.39 If X is normal N(9, 1) and 9 has a Cauchy density b/{n[b 2 + (9 — 
/x) 2 ]}, determine the possible shapes of the HPD regions for varying /x and b. 

Problem 5.40 Let 9 = (Si,... ,6 S ) with 9, real-valued, X have density pe(x), 
and 0 a prior density n(9). Then the 1007 % HPD region is the 1007 % credible 
region R that has minimum volume. 

[Apply the Neyman-Pearson fundamental lemma to the problem of minimizing 
the volume of i?.] 

Problem 5.41 Let Xi ,..., X m and Y \,..., Y n be independently distributed as 
N( £, a 2 ) and N(p, a 2 ) respectively, and let (£, p, a) have the joint improper prior 
density given by 

7t(£, p, a) d £ dp da = dp ■ — da for all — oo < £, p < oo, 0 < a. 

a 

Under these assumptions, extend the results of Examples 5.7.3 and 5.7.4 to 
inferences concerning (i) p — £ and (ii) a. 

Problem 5.42 Let Xi ,..., X m and Y \,..., Y n be independently distributed as 
N(t;,a 2 ) and N(p,r 2 ), respectively and let (£,,p,a,r) have the joint improper 
prior density 7r(£, p, a, r) dp da dr = d£dp(l/a) da(l/r) dr. Extend the result 
of Example 5.7.4 to inferences concerning r 2 /<r 2 . 

Note. The posterior distribution of p — £ in this case is the so-called Behrens- 
Fisher distribution. The credible regions for p — £, obtained from this distribution 
do not correspond to confidence intervals with fixed coverage probability, and the 
associated tests of H : p = £ thus do not have fixed size (which instead depends 
on r/a). From numerical evidence [see Robinson (1976) for a summary of his and 
earlier results] it appears that the confidence intervals are conservative, that is, 
the actual coverage probability always exceeds the nominal one. 

Problem 5.43 Let Ti,...,T s _i have the multinomial distribution (2.34), and 
suppose that (pi,... ,p s -i) has the Dirichlet prior density D(ai,... ,a s ) with 
density proportional top“ 1-1 .. .p“ a_1 , where p s = 1 — (pi + - • --|-ps_i). Determine 
the posterior distribution of (pi,.. . ,p s - 1 ) given the T’s. 


Section 5.8 

Problem 5.44 Prove Theorem 5.8.1 for arbitrary values of c. 
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Section 5.9 

Problem 5.45 If c = 1, m == n = 4, a = .1 and the ordered coordinates 
2 (!),..., Z(n) °f a point 2 are 1.97, 2.19, 2.61, 2.79, 2.88, 3.02, 3.28, 3.41, determine 
the points of S(z) belonging to the rejection region (5.53). 


Problem 5.46 Confidence intervals for a shift. [Maritz (1979)] 

(i) Let Xi ,..., X m ; Yi,..., Y n be independently distributed according to con¬ 
tinuous distributions F(x) and G(y) = F(y — A) respectively. Without 
any further assumptions concerning F, confidence intervals for A can 
be obtained from permutation tests of the hypotheses H( Ao) : A = 
A 0 . Specifically, consider the point ( 21 ,..., 2 m+ „) = (xi, ..., x m , yi — 
A,..., y n — A) and the ( m + n ) permutations ii < ■ ■ ■ < i m ; im+i < ■ ■ ■ < 
im+n of the integers 1,... ,m + n. Suppose that the hypothesis H( A) is 
accepted for the k of these permutations which lead to the smallest values 
of 

m-\-n m 

J2 ZiJn-Y^Zij/m 

j=m+1 j = 1 

where 

Then the totality of values A for which H( A) is accepted constitute an 
interval, and these intervals are confidence intervals for A at confidence 
level 1 — a. 


(ii) Let Zi,..., Z N be independently distributed, symmetric about 9 , with 
distribution F(z — 9), where F{z) is continuous and symmetric about 0. 
Without any further assumptions about F, confidence intervals for 9 can be 
obtained by considering the 2 N points Z[,..., Z' N where Z[ = dz(Zi — 9 0 ), 
and accepting H(9o) : 9 = 9 q for the k of these points which lead to the 
smallest values of ^2 \Z[\, where k = (1 — a)2 N . 


[(i): A point is in the acceptance region for H( A) if 


E {Vi - A ) _ Ez» 

n m 


\y-x- A| 


is exceeded by at least ( m ^ rl ) — k of the quantities \y' — x' — 7 AI, where 
(xi,..., 4 , y'l, ■ ■ ■, y'n) is a permutation of (xi,..., x m , j/i,..., y n ), the quantity 
7 is determined by this permutation, and I 7 I < 1. The desired result now follows 
from the following facts (for an alternative proof, see Section 14): (a) The set 
of A’s for which (y — x — A ) 2 < (y' — x! — 7 A ) 2 is, with probability one, an 
interval containing y — x. (b) The set of A’s for which (y — x — A ) 2 is exceeded 
by a particular set of at least ( m J( n ) — ^ °f quantities ( y’ — x' — 7 A ) 2 is the 
intersection of the corresponding intervals (a) and hence is an interval containing 
y — x. (c) The set of A’s of interest is the union of the intervals (b) and, since 
they have a nonempty intersection, also an interval.] 
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Section 5.10 

Problem 5.47 In the matched-pairs experiment for testing the effect of a treat¬ 
ment, suppose that only the differences Zi = Y* — X % are observable. The Z 's are 
assumed to be a sample from an unknown continuous distribution, which under 
the hypothesis of no treatment effect is symmetric with respect to the origin. Un¬ 
der the alternatives it is symmetric with respect to a point ( > 0. Determine the 
test which among all unbiased tests maximizes the power against the alternatives 
that the Z’s are a sample from N(£,a 2 ) with £ > 0. 

[Under the hypothesis, the set of statistics Zf,..., X^"=i Z? n ) is suffi¬ 

cient; that it is complete is shown as the corresponding result in Theorem 5.8.1. 
The remainder of the argument follows the lines of Section 11.] 

Problem 5.48 (i) If AT,...,A'„; Yi,...,Y n are independent normal vari¬ 
ables with common variance a 2 and means E(Xi) = E(\\) = + A, 

the UMP unbiased test of A = 0 against A > 0 is given by (5.58). 

(ii) Determine the most accurate unbiased confidence intervals for A. 

[(i): The structure of the problem becomes clear if one makes the orthogonal 
transformation X' = (Y - A i)/V2, Y( = (X; + Yi)/V‘ 2.] 

Problem 5.49 Comparison of two designs. Under the assumptions made at the 
beginning of Section 12, one has the following comparison of the methods of 
complete randomization and matched pairs. The unit effects and experimental 
effects Ui and Vi are independently normally distributed with variances a 2 , a 2 
and means E(Ui) = p and E(Vi) = £ or 77 as V) corresponds to a control or 
treatment. With complete randomization, the observations are X ; = Ui + V 
(i = 1 ,..., n) for the controls and Y = U n +i + V n +i (i = 1,..., n) for the treated 
cases, with E(Xi) = p+£, E(Yi) = p+r/. For the matched pairs, if the matching is 
assumed to be perfect, the A’s are as before, but Y = Ui + V m +i. UMP unbiased 
tests are given by (5.27) for complete randomization and by (5.58) for matched 
pairs. The distribution of the test statistic under an alternative A = r/ — £ is the 
noncentral t-distribution with noncentrality parameter ^/nA/^/2(cr 2 + < 7 j) and 
2n — 2 degrees of freedom in the first case, and with noncentrality parameter 
\JnXI\[2a and n — 1 degrees of freedom in the second. Thus the method of 
matched pairs has the disadvantage of a smaller number of degrees of freedom 
and the advantage of a larger noncentrality parameter. For a = .05 and A = 4, 
compare the power of the two methods as a function of n when ai, a = 2 and 
when ay = 2, a = 1 . 

Problem 5.50 Continuation. An alternative comparison of the two designs is 
obtained by considering the expected length of the most accurate unbiased con¬ 
fidence intervals for A = r/ — ^ in each case. Carry this out for varying n and 
confidence coefficient 1 — a = .95 when ui = 1, a ■= 2 and when <ti = 2, <r = 1. 


Section 5.11 

Problem 5.51 Suppose that a critical function (j >o satisfies (5.64) but not (5.66), 
and let a < \. Then the following construction provides a measurable critical 
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function f> satisfying (5.66) and such that rf>o(z) < <f(z) for all z Inductively, 
sequences of functions <j>i, <j> 2 , ■ ■ ■ and tpo, ipi, ■ ■ ■ are defined through the relations 


ipm{z) 


E 

2 / GS(z) 


0m (z ) 

N!\...N c V 


m = 0 , 1 ,, 


and 


0m (-2) — 


0m —1 (z) + [a-ip m — 1 (*) ] 

if both <j> m -i{z) and r ip m -\{z) are < a, 
<j>m-i{z) otherwise. 


The function <j>(z) = lim (p m (z) then satisfies the required conditions. 

[The functions 4> m are nondecreasing and between 0 and 1. It is further seen by 
induction that 0 < a — ip m (z) < (1 — 7 ) m [a — ^0(2)], where 7 = 1 / (iV"i ! ... N c \).] 


Problem 5.52 Consider the problem of testing H : 77 = £ in the family of 
densities (5.61) when it is given that a > c > 0 and that the point (Cn> • • •, £ c n c 
of (5.62) lies in a bounded region R containing a rectangle, where c and R are 
known. Then Theorem 5.11.1 is no longer applicable. However, unbiasedness of 
a test <j> of H implies (5.66), and therefore reduces the problem to the class of 
permutation tests. 

[Unbiasedness implies f {<p{z)p (7 x( z ) dz = a and hence 


a = 


j i>{z)PaAz) dz 



^EE^ 


2 cr 2 



dz 


for all a > c and £ in R. The result follows from completeness of this last family.] 


Problem 5.53 To generalize Theorem 5.11.1 to other designs, let Z = 
(Zi,...,Zn) and let G = {gi,...,g r } be a group of permutations of N co¬ 
ordinates or more generally a group of orthogonal transformations of TV-space 
If 

P ^{z) - l g (- 2^1* - 9 ,£| 2 ) , (5.80) 

where | 2 | 2 = z ii then f (f)(z)p&x( z ) dz < a for all a > 0 and all £ implies 

- </ > (2 / ) < a a.e., (5.81) 

z' (=S(z) 

where S(z) is the set of points in IV-space obtained from z by applying to it all 
the transformations g *., k = 1 ,..., r. 


Problem 5.54 Generalization of Corollary 5.11.1. Let H be the class of densi¬ 
ties (5.80) with cr > 0 and —00 < £j < 00 (i = 1,..., N). A complete family of 
tests of H at level of significance a is the class of permutation tests satisfying 


1 


E 

z'G5(z) 


= a 


r 


a.e. 


(5.82) 
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Section 5.12 

Problem 5.55 If e = 1, m = n = 3, and if the ordered a;’s and y 's are respec¬ 
tively 1.97, 2.19, 2.61 and 3.02, 3.28, 3.41, determine the points <5(i),..., <5(ig) 
defined as the ordered values of (5.72). 


Problem 5.56 If c = 4, to» = n; = 1, and the pairs ( Xi,yi ) are (1.56,2.01), 
(1.87,2.22), (2.17,2.73), and (2.31,2.60), determine the points <5(i),..., <5(is) which 
define the intervals (5.71). 


Problem 5.57 If m, n are positive integers with m < n, then 

t(s) (;)-("r)-‘ 


Problem 5.58 (i) Generalize the randomization models of Section 14 for 

paired comparisons (ni ri c — 2) and the case of two groups (c = 1) 

to an arbitrary number c of groups of sizes m,..., n c . 

(ii) Generalize the confidence intervals (5.71) and (5.72) to the randomization 
model of part (i). 


Problem 5.59 Let Z\...., Z n be i.i.d. according to a continuous distribution 
symmetric about 9 , and let T(i) < • • • < P(m) be the ordered set of M = 2 n — 1 
subsamples; (Z^ + ■ ■ ■ + Zi r )/r, r < 1. If T( 0 ) = —oo, T( M+1 ) = oo, then 

Pe[T(i) <9 < T (i+1) ] = M ^ for all i = 0,1,..., M. 

[Hartigan (1969).] 


Problem 5.60 (i) Given n pairs (xi,yi ),..., (x„, y n ), let G be the group of 

2™ permutations of the 2 n variables which interchange Xi and yi in all, 
some, or none of the n pairs. Let Go be any subgroup of G, and let e be 
the number of elements in Go- Any element g £ Go (except the identity) 
is characterized by the numbers ii,... ,i r (r > 1) of the pairs in which Xi 
and yi have been switched. Let di = yt — Xi, and let <5(i) < • • • < <5( e -i), 
denote the ordered values ( di 1 + • • • + di r )/r corresponding to Go- Then 
(5.71) continues to hold with e — 1 in place of M. 

(ii) State the generalization of Problem 5.59 to the situation of part (i). 

[Hartigan (1969).] 


Problem 5.61 The preceding problem establishes a 1 : 1 correspondence be¬ 
tween e — 1 permutations T of Go which are not the identity and e — 1 nonempty 
subsets {«i,..., i r } of the set {1,..., n}. If the permutations T and T' correspond 
respectively to the subsets R = {h,... ,i r } and R' = {j i,..., j s }, then the group 
product T'T corresponds to the subset (R fl S) U (R fl S) = (R U S) — (R D S). 
[Hartigan (1969).] 
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Problem 5.62 Determine for each of the following classes of subsets of 
{l,...,n} whether (together with the empty subset) it forms a group under 
the group operation of the preceding problem: All subsets {*i,..., i r } with 

(i) r = 2 ; 

(ii) r = even; 

(iii) r divisible by 3. 

(iv) Give two other examples of subgroups Go of G. 

Note. A class of such subgroups is discussed by Forsythe and Hartigan 
(1970). 

Problem 5.63 Generalize Problems 5.60(i) and 5.61 to the case of two groups 
of sizes m and n (c = 1 ). 


Section 5.13 


Problem 5.64 (i) If the joint distribution of A' and Y is the bivariate normal 

distribution (5.69), then the conditional distribution of Y given x is the 
normal distribution with variance t 2 (1 — p 2 ) and mean 77 + (pr/cr)(x — £). 

(ii) Let (Xj, Yl), ..(A'„, Y n ) be a sample from a bivariate normal distribution, 
let R be the sample correlation coefficient, and suppose that p = 0. Then 
the conditional distribution of y/n — 2R/y/l — R 2 given an,....., x n , is Stu¬ 
dent’s t-distribution with n— 2 degrees of freedom provided ^(an— x) 2 > 0 . 
This is therefore also the unconditional distribution of this statistic. 


(iii) The probability density of R itself is then 


p(r) 


1 r[|(n-l)] 

yfnT[\(n-2)\ 


( 1 -r 2 ) 


1 

2 


n-2 


(5.83) 


[(ii): If Vi = (an — x)/y/^2(xj — x ) 2 so that ^ Vi = 0, Y2 v i = I; the statistic can 
be written as 

_ VjYj _ 

^[Ef?-nT 2 -(E^) 2 ]/(n-2)' 

Since its distribution depends only on p one can assume r) = 0, r = 1. The desired 
result follows from Problem 5.6 by making an orthogonal transformation from 
(Yi,,..., Y u ) to (Zi,..., Z n ) such that Zi = y/nY , Z 2 = J2 ViYi] 


Problem 5.65 (i) Let (AT, 14), ..., (A' n , Y n ) be a sample from the bivariate 

normal distribution (5.69), and let S 2 = ^)(Xj — A) 2 , S 2 = — Y) 2 , 

S 12 = ^2(Xi — X)(Yi — l r ). There exists a UMP unbiased test for testing 
the hypothesis r/cr = A. Its acceptance region is 

| A 2 g 2 — Sj\ 

V(A 2 SI + Sly - 4:A 2 s 2 2 - ’ 

and the probability density of the test statistic is given by (5.83) when the 
hypothesis is true. 
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(ii) Under the assumption t = a, there exists a UMP unbiased test for testing 
77 = £, with acceptance region |Y — X\/y/ S 2 + Sf — 2 S 12 < C. On multipli¬ 
cation by a suitable constant the test statistic has Student’s f-distribution 
with n — 1 degrees of freedom when rj = 

[Due to Morgan (1939) and Hsu (1940). (i): The transformation U = AX + Y, 
V = X — (1/A)Y reduces the problem to that of testing that the correlation 
coefficient in a bivariate normal distribution is zero. 

(ii): Transform to new variables Vi = Yi — Xi, Ui = Yj + A,.] 

Problem 5.66 (i) Let (Ai, 11),..., (A'„, Y n ) be a sample from the bivariate 

normal distribution (5.73), and let S 2 = Y2(^i — A) 2 , S 12 = YK^i — 
X)(Yi-Y), S 2 2 = £(Y-Y) 2 . 

Then (S' 2 , S 12 , S 2 ) are independently distributed of (A, Y), and their joint 
distribution is the same as that of A^ 2 , Aj'Yj', Y)' 2 ), 

where (A), Y/), * = 1,..., n — 1, are a sample from the distribution (5.73) 
with £ = 77 = 0 . 

(ii) Let Ai,...,A m and Yi,...,Y m be two samples from A(0,1). Then the 
joint density of Sf = Y2 A 2 , S 12 = Y2 A; Y, Sf = Y1 Y i is 

4Trf\m^lj ( s ^ s ^ — s i 2 ) * (m_3) exp [— ^ (s? + s%) 

for S 12 < s 2 s 2 , and zero elsewhere. 

(iii) The joint density of the statistics (S 2 , S 12 , S|) of part (i) is 

(sjs|-sf 2 )^ (,1 ~ 4) [ 1 (si _ 2 pai 2 fiV 

4 7 rr(7 l -2)(arvT ^)"' 1 ~ ^ U a ^ r 2 )_ 

(5.84) 

for s 2 2 < s 2 «2 and zero elsewhere. 

[(i): Make an orthogonal transformation from Ai,..., A„ to A{,..., X' n such that 
X' n = y/nX, and apply the same orthogonal transformation also to Yl, ..., Y n . 
Then 

n — 1 n 

Yn = VnY , X-Y- = ^(A; - A)(Y - Y), 

i=1 i =1 

n — 1 71 71—1 71 

E x ' 2 = E y ‘ , 2 c D y >- ? ) 2 - 

i=l i=l i =1 i= 1 

The pairs of variables (A(, Y/),..., (X„, Y„) are independent, each with a bi¬ 
variate normal distribution with the same variances and correlation as those of 
(A, Y) and with means E( X[) — E(Y.{) = 0 for i = 1,..., n — 1. 

(ii): Consider first the joint distribution of S 12 = YZ x iY% and S ' 2 = YZ^i given 
Xi ... ,x m . Letting Z\ = S 12 /-\JYZ x i an d making an orthogonal transformation 
from Yi,...,Y m to so that Sf = YZZLi^it the variables Z\ and 

YZZZi=i ^i = S 2 — Zi are independently distributed as A( 0 , 1 ) and Xm-i respec¬ 
tively. From this the joint conditional density of S '12 = S\Z\ and S 2 is obtained by 
a simple transformation of variables. Since the conditional distribution depends 
on the x’s only through s 2 , the joint density of S 2 , S 12 , S 2 is found by multiplying 
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the above conditional density by the marginal one of Sf, which is Xm- The proof 
is completed through use of the identity 


r \{ m - 1) r(|m) 


v^r(m — 1) 
2 m_2 


(iii): If (X',Y') = (X[, Y {;...; X' m , Y^) is a sample from a bivariate normal 
distribution with £ = p = 0, then T = X' 2 , X[Y(, Y/ 2 ) is sufficient 

for 9(a,p,r), and the density of T is obtained from that given in part (ii) for 
#0 = (1,0,1) through the identity [Problem 3.39 (i)] 


Pe(t) 


Pe 0 (t) 


pf' Y \x',y') 

Peo ’ Y '( x '^ y ') 


The result now follows from part (i) with m = n — 1.] 


Problem 5.67 If (Xi, Yi),..., (X n , Y n ) is a sample from a bivariate normal 
distribution, the probability density of the sample correlation coefficient R is 15 

on—3 


p P (r) = 


( l_ p 2 )K"-i) (1 _ r 2 ) |(»-4) 


ir(n — 3)! 

xf; r* [| („+*-!)] Eg! 


(5.85) 


k =0 


or alternatively 


/ \ ^ /-1 2\^(n — l)/-i 2\^(n—4) 

p P {r) = (1 - p ) 2 '(1 — r ) 2k 

: dt. 


(5.86) 


x 


(l_prt)n-r 

Another form is obtained by making the transformation t — (1 — v)/(l — prv) in 
the integral on the right-hand side of (5.86). The integral then becomes 

^ (1 - v) n ~ 2 


1 


-1 —1/2 

■ 1 — \v{l + pr) dv. 


(l-pr)3( 2 ™- 3 ) 


V2u 


Expanding the last factor in powers of v, the density becomes 

W ~ 2 r - ^) l(n_1) ( 1 - r 2 )^\l - pr)~ n+ i 

V27r r [n — 2 ) 


(5.87) 


(5.88) 


xF 


i.l +P r 


I -!- n - 
2 ’ 2 ’ 2 ’ 2 


where 


F(abcx) = T r{a + j)r{b + j) F(C) ^ 
l ’ ’ ’ j 2^ r(a) m r( c + j) j\ 


3=0 


(5.89) 


is a hypergeometric function. 


15 The distribution of R is reviewed by Johnson and Kotz (1970, Vol. 2, Section 32) 
and Patel and Read (1982). 
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[To obtain the first expression make a transformation from (Si, S$, S 12 ) with 
density (5.84) to ( Sf,Si,R ) and expand the factor exp{psi 2 /(l — p 2 )ar} = 
exp{prsiS 2 /(l — p 2 )crr} into a power series. The resulting series can be inte¬ 
grated term by term with respect to s 2 and s 2 . The equivalence with the second 
expression is seen by expanding the factor (1 — under the integral in 

(5.86) and integrating term by term.] 

Problem 5.68 If X and Y have a bivariate normal distribution with correlation 
coefficient p > 0, they are positively regression-dependent. 

[The conditional distribution of Y given x is normal with mean p + pra~ 1 (x — £) 
and variance t 2 (1 — p 2 ). Through addition to such a variable of the positive 
quantity pra^ 1 (x' — x) it is transformed into one with the conditional distribution 
of Y given x' > x.] 

Problem 5.69 (i) The functions (5.78) are bivariate cumulative distributions 

functions. 

(ii) A pair of random variables with distribution (5.78) is positively regression- 
dependent. [The distributions (5.78) were introduced by Morgenstem 
(1956).] 

Problem 5.70 If X, Y are positively regression dependent, they are positively 
quadrant dependent. 

[Positive regression dependence implies that 

P[Y < y | X < x\ > P[Y < y \ X < x] for all x < x' and y, (5.90) 
and (5.90) implies positive quadrant dependence.] 


5.15 Notes 

The optimal properties of the one- and two-sample normal-theory tests were ob¬ 
tained by Neyman and Pearson (1933) as some of the principal applications of 
their general theory. Theorem 5.1.2 is due to Basu (1955), and its uses are re¬ 
viewed in Boos and Hughes-Oliver (1998). For converse aspects of this theorem see 
Basu (1958), Koehn and Thomas (1975), Bahadur (1979), Lehmann (1980) and 
Basu (1982). An interesting application is discussed in Boos and Hughes-Oliver 
(1998). In some exponential family regression models where UMPU tests do not 
exist, classes of admissible, unbiased tests are obtained in Cohen, Kemperman 
and Sackrowitz (1994). 

The roots of the randomization model of Section 5.10 can be traced to Neyman 
(1923); see Speed (1990) and Fienberg and Tanur (1996). Permutation tests, as 
alternatives to the standard tests having fixed critical levels, were initiated by 
Fisher (1935a) and further developed, among others, by Pitman (1937, 1938a), 
Lehmann and Stein (1949), Hoeffding (1952), and Box and Andersen (1955). 
Some aspects of these tests are reviewed in Bell and Sen (1984) and Good (1994). 
Applications to various experimental designs are given in Welch (1990). Optimal¬ 
ity of permutation tests in a multivariate nonparametric two-sample setting are 
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studied in Runger and Eaton (1992). Explicit confidence intervals based on sub¬ 
sampling were given by Hartigan (1969). The theory of unbiased confidence sets 
and its relation to that of unbiased tests is due to Neyman (1937a). 



6 

Invariance 


6.1 Symmetry and Invariance 

Many statistical problems exhibit symmetries, which provide natural restrictions 
to impose on the statistical procedures that are to be employed. Suppose, for 
example, that X\..... X n are independently distributed with probability densi¬ 
ties pg 1 (xi),... ,pg n (x n ). For testing the hypothesis H : 9i = ■ ■ ■ = 6„ against 
the alternative that the 0’s are not all equal, the test should be symmetric in 
xi,... ,x n , since otherwise the acceptance or rejection of the hypothesis would 
depend on the (presumably quite irrelevant) numbering of these variables. 

As another example consider a circular target with center O, on which are 
marked the impacts of a number of shots. Suppose that the points of impact 
are independent observations on a bivariate normal distribution centered on O. 
In testing this distribution for circular symmetry with respect to O, it seems 
reasonable to require that the test itself exhibit such symmetry. For if it lacks 
this feature, a two-dimensional (for example, Cartesian) coordinate system is 
required to describe the test, and acceptance or rejection will depend on the 
choice of this system, which under the assumptions made is quite arbitrary and 
has no bearing on the problem. 

The mathematical expression of symmetry is invariance under a suitable group 
of transformations. In the first of the two examples above the group is that of 
all permutations of the variables X\ ,..., x n since a function of n variables is 
symmetric if and only if it remains invariant under all permutations of these 
variables. In the second example, circular symmetry with respect to the center 
O is equivalent to invariance under all rotations about O. 

In general, let X be distributed according to a probability distribution Pg , 9 £ 
fi, and let g be a transformation of the sample space X. All such transformations 
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considered in connection with invariance will be assumed to be 1 : 1 transfor¬ 
mations of X onto itself. Denote by gX the random variable that takes on the 
value gx when A' = x, and suppose that when the distribution of A is Pg, 9 £ LI, 
the distribution of gX is Pv with 9' also in Li. The element O' of Li which is 
associated with 9 in this manner will be denoted by g9, so that 

Pe{gX £ A} = P- ge {X £ A}. (6.1) 

Here the subscript 9 on the left member indicates the distribution of X, not that 
of gX. Equation (6.1) can also be written as Pg(g~ 1 A) = P g g(A) and hence as 

P s e(gA) = P 9 (A). (6.2) 

The parameter set Li remains invariant under g (or is preserved by g) if g9 £ Li 

for all 9 £ Li, and if in addition for any 9' £ Li there exists 9 £ Li such that 
g9 = 9'. These two conditions can be expressed by the equation 

gLi = O. (6.3) 

The transformation g of Li onto itself defined in this way is 1 : 1 provided the 
distributions Pg corresponding to different values of 9 are distinct. To see this let 
g9\ = Then P g g 1 (gA) = P g g 2 (gA) and therefore Pg 1 (A) = Pg 2 (A) for all A, 
so that 9 1 = # 2 - 

Lemma 6.1.1 Let g, g' be two transformations preserving Li. Then the trans¬ 
formations g'g and g _1 defined by 

(g' g)x = g (gx) and g(g~ 1 x) = x for all x £ X 
also preserve LI and satisfy 

g'g = g'-S and (fl _1 ) = (fl) _1 . (6.4) 

Proof. If the distribution of A' is Pg then that of gX is P g g and that of g'gX = 
g'(gX) is therefore P g ’ g e- This establishes the first equation of (6.4); the proof of 
the second one is analogous. ■ 

We shall say that the problem of testing H : 9 £ Qh against K : 9 € LIk 
remains invariant under a transformation g if g preserves both Qh and LIk, so 
that the equation 

gfl H = LIh (6.5) 

holds in addition to (6.3). Let C be a class of transformations satisfying these 
two conditions, and let G be the smallest class of transformations containing C 
such that g,g' £ G implies that g'g and g belong to G. Then G is a group of 
transformations, all of which by Lemma 6.1.1 preserve both and Qh- Any class 
C of transformations leaving the problem invariant can therefore be extended 
to a group G. It follows further from Lemma 6.1.1 that the class of induced 
transformations g form a group G. The two equations (6.4) express the fact that 
G is a homomorphism of G. 

In the presence of symmetries in both sample and parameter space represented 
by the groups G and G, it is natural to restrict attention to tests f> which are 
also symmetric, that is, which satisfy 

4>{gx) = cj>{x ) for all x £ X and g £ G. 


(6.6) 
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A test cj> satisfying (6.6) is said to be invariant under G. The restriction to 
invariant tests is a particular case of the principle of invariance formulated in 
Section 1.5. As was indicated there and in the examples above, a transformation 
g can be interpreted as a change of coordinates. From this point of view, a test 
is invariant if it is independent of the particular coordinate system in which the 
data are expressed. 1 

A transformation g, in order to leave a problem invariant, must in particu¬ 
lar preserve the class A of measurable sets over which the distributions Pg are 
defined. This means that any set A £ A is transformed into a set of A and is 
the image of such a set, so that gA and g~ 1 A both belong to A. Any transfor¬ 
mation satisfying this condition is said to be bimeasurable. Since a group with 
each element g also contains g its elements are automatically bimeasurable if 
all of them are measurable. If g' and g are bimeasurable, so are g'g and g 1 - The 
transformations of the group G above generated by a class C are therefore all 
bimeasurable provided this is the case for the transformations of C. 


6.2 Maximal Invariants 

If a problem is invariant under a group of transformations, the principle of in¬ 
variance restricts attention to invariant tests. In order to obtain the best of these, 
it is convenient first to characterize the totality of invariant tests. 

Let two points xi,X 2 be considered equivalent under G, 

xi ~ X 2 ( mod G), 

if there exists a transformation g £ G for which X 2 = gx i. This is a true equiva¬ 
lence relation, since G is a group and the sets of equivalent points, the orbits of G, 
therefore constitute a partition of the sample space. (Cf. Appendix, Section A.l.) 
A point x traces out an orbit as all transformations g of G are applied to it; this 
means that the orbit containing x consists of the totality of points gx with g £ G. 
It follows from the definition of invariance that a function is invariant if and only 
if it is constant on each orbit. 

A function M is said to be maximal invariant if it is invariant and if 

M(xi) = M{x2) implies X 2 = gxi for some g £ G, (6.7) 

that is, if it is constant on the orbits but for each orbit takes on a different value. 
All maximal invariants are equivalent in the sense that their sets of constancy 
coincide. 

Theorem 6.2.1 Let M(x) be a maximal invariant with respect to G. Then, a 
necessary and sufficient condition for <f> to be invariant is that it depends on x only 
through M(x); that is, that there exists a function h for which <f>(x) = h[M(x)] 
for all x. 


1 The relationship between this concept of invariance under reparametrization and 
that considered in differential geometry is discussed in Barndorff-Nielson, Cox and Reid 
(1986). 
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Proof. If <p(x) = h[M(x)\ for all x, then 4>{gx) = h[M(gx)] = h[M(x)\ = <j>(x) 
so that <j> is invariant. On the other hand, if tj> is invariant and if M(x 1 ) = M(x 2 ), 
then X 2 = gxi for some g and therefore <f>(x 2 ) = 4>(x 1 ). ■ 

Example 6.2.1 (i) Let x = (a?i,..., *„), and let G be the group of translations 

gx = (xi + c,..., x„ + c), —00 < c < 00 . 

Then the set of differences y = (*1 — x n , ■ ■ ., x n -i — x n ) is invariant under G. To 
see that it is maximal invariant suppose that Xi—x n = x'i — x' n for i = 1,...,n— 1. 
Putting x' n — Xn = c, one has = Xi + c for all i, as was to be shown. The function 
y is of course only one representation of the maximal invariant. Others are for 
example (xi — X 2 , X 2 — * 3 ,..., x n -i — x„) or the redundant (xi — x,..., x„ —x). In 
the particular case that n = 1, there are no invariants. The whole space is a single 
orbit, so that for any two points there exists a transformation of G taking one 
into the other. In such a case the transformation group G is said to be transitive. 
The only invariant functions are then the constant functions 4>{x) = c. 

(ii) if G is the group of transformations 

gx = (cx 1 ,..., cx n ), c ^ 0, 

a special role is played by any zero coordinates. However, in statistical applica¬ 
tions the set of points for which none of the coordinates is zero typically has 
probability 1; attention can then be restricted to this part of the sample space, 
and the set of ratios x\/x n ,. ■ ., x n -i/x n is a maximal invariant. Without this 
restriction, two points x, x' are equivalent with respect to the maximal invariant 
partition if among their coordinates there are the same number of zeros (if any), 
if these occur at the same places, and if for any two nonzero coordinates Xi,Xj 
the ratios Xj/xi and x'j/x'i are equal. 

(iii) Let x = (* 1 ,... ,*„), and let G be the group of all orthogonal transfor¬ 
mations x' = Trc of n-space. Then ^2 xf is maximal invariant, that is, two points 
x and x* can be transformed into each other by an orthogonal transformation 
if and only if they have the same distance from the origin. The proof of this is 
immediate if one restricts attention to the plane containing the points x, x* and 
the origin. ■ 

Example 6.2.2 (i) Let x = (xi ,..., x n ), and let G be the set of n\ permutations 
of the coordinates of x. Then the set of ordered coordinates (order statistics ) 
*( 1 ) < • • • < X( n ) is maximal invariant. A permutation of the Xi obviously does 
not change the set of values of the coordinates and therefore not the x^). On the 
other hand, two points with the same set of ordered coordinates can be obtained 
from each other through a permutation of coordinates. 

(ii) Let G be the totality of transformations x\ = f(xi), i = 1,..., n, such that / 
is continuous and strictly increasing, and suppose that attention can be restricted 
to the points that have n distinct coordinates. If the Xi are considered as n points 
on the real line, any such transformation preserves their order. Conversely, if 
xi ,..., x n and x '\,..., x' n are two sets of points in the same order, say Xi 1 < • • • < 
Xi n and x\ 1 < • • • < x ' in , there exists a transformation / satisfying the required 
conditions and such that x\ = f(xi) for all i. It can be defined for example as 
f(x) = x + (x'j — Xij) for x < Xi lt f(x) = x + (x' in — Xi n ) for x > Xi n , and 
to be linear between Xi k and x-i k+1 for k = 1,..., n — 1. A formal expression for 
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the maximal invariant in this case is the set of ranks (n, ■ ■ •, r n ) of (a?i,..., x n )- 
Here the rank ri of Xi is defined through 

X i — ^( r i ) 

so that n is the number of x’s < Xi. In particular, n = 1 if xt is the smallest 
x, Vi = 2 if it is the second smallest, and so on. ■ 

Example 6.2.3 Let x be an n x s matrix (s < n) of rank s, and let G be the 
group of linear transformations gx = xB, where B is any nonsingular sxs matrix. 
Then a maximal invariant under G is the matrix t(x) = x(x T x)~ 1 x T , where x T 
denotes the transpose of x. Here (a; T *) _1 is meaningful because the sxs matrix 
x T x is nonsingular; see Problem 6.3. That t(x ) is invariant is clear, since 

t(gx) = xB(B T x T xB)~ 1 B T x T = x(x T x)~ 1 x T = t(x). 

To see that t(x) is maximal invariant, suppose that 

, T ,-l T / T n-1 

Xl(XiXl) Xi =X2(X 2 X2) X 2 - 

Since (xf Xi)^ 1 is positive definite, there exist nonsingular matrices Ci such that 
(xjxi)^ 1 = CiCf and hence 

(ziClXiClCl) 7, = (X2C2)(X2C2) T . 

This implies the existence of an orthogonal matrix Q such that X 2 C 2 = X 1 C 1 Q 
and thus *2 = * 1 B with B = CiQC^ 1 , as was to be shown. 

In the special case s = n, we have t(x) = /, so that there are no nontrivial 
invariants. This corresponds to the fact that in this case G is transitive, since any 
two nonsingular nxn matrices Xi and x 2 satisfy x 2 = * 1 B with B = x^ 1 x 2 - This 
result can be made more intuitive through a geometric interpretation. Consider 
the s-dimensional subspace S of R n spanned by the s columns of x. Then P = 
x(x T x)~ 1 x T has the property that for any y in R n , the vector Py is the projection 
of y onto S. (This will be proved in Section 7.2.) The invariance of P expresses 
the fact that the projection of y onto S is independent of the choice of vectors 
spanning S. To see that it is maximal invariant, suppose that the projection of 
every y onto the spaces Si and S 2 spanned by two different sets of s vectors is 
the same. Then Si = S 2 , so that the two sets of vectors span the same space. 
There then exists a nonsingular transformation taking one of these sets into the 
other. ■ 

A somewhat more systematic way of determining maximal invariants is ob¬ 
tained by selecting, by means of a specified rule, a unique point M(x) on 
each orbit. Then clearly M(X) is maximal invariant. To illustrate this method, 
consider once more two of the earlier examples. 

Example 6.2.1(i) (continued). The orbit containing the point (ai,..., a n ) un¬ 
der the group of translations is the set (ai + c,..., a n + c), —00 < c < 00 }, which 
is a line in E n . 

(a) As representative point M(x) on this line, take its intersection with the 
hyperplane x n = 0. Since then a n + c = 0, this point corresponds to the 
value c = — a n and thus has coordinates (ai — a n ,..., a n -i — a n , 0). This 
leads to the maximal invariant (xi x n -i — x „). 
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(b) An alternative point on the line is its intersection with the hyperplane 
52 x i = 0. Then c = —a, and M(a) = (oi — a ,..., a n — a). 

(c) The point need not be specified by an intersection property. It can for in¬ 
stance be taken as the point on the line that is closest to the origin. Since 
the value of c minimizing 52( a * + c) 2 is c = —a, this leads to the same point 
as (b). ■ 

Example 6.2.1(iii) (continued). The orbit containing the point (ai,...,a n ) 
under the group of orthogonal transformations is the hypersphere containing 
(a i ,... ,a n ) and with center at the origin. As representative point on this sphere, 
take its north pole, i.e. the point with aq = • • • = a„~i = 0. The coordinates of 
this point are (0,..., 0, \/52 a ?) and hence lead to the maximal invariant 52 x i- 
(Note that in this example, the determination of the orbit is essentially equivalent 
to the determination of the maximal invariant.) ■ 

Frequently, it is convenient to obtain a maximal invariant in a number of 
steps, each corresponding to a subgroup of G. To illustrate the process and a 
difficulty that may arise in its application, let x = (xi,..., x„), suppose that the 
coordinates are distinct, and consider the group of transformations 

gx = (ax i + &,..., axn + 6), a^O, —oo < b < oo. 

Applying first the subgroup of translations x( = xi + b, a maximal invariant is 
y = (j/i,... ,2/n-i) with yi = Xi — x „. Another subgroup consists of the scale 
changes x" = axi. This induces a corresponding change of scale in the y’s: y" = 
ay-i, and a maximal invariant with respect to this group acting on the y-space is 
z = (z i, •.., z n - 2 ) with Zi = yi/y n ~ 1 . Expressing this in terms of the a:’s, we get 
Zi = (xi — x n )/(x„-i — Xn), which is maximal invariant with respect to G. 

Suppose now the process is carried out in the reverse order. Application first 
of the subgroup x" = axi yields as maximal invariant u = (wi,..., u n ~i) with 
Ui = Xi/Xn. However, the translations x\ = Xi + b do not induce transformations 
in u-space, since (xi + b)/(x n + 6) is not a function of Xi/x n . 

Quite generally, let a transformation group G be generated by two subgroups 
D and E in the sense that it is the smallest group containing D and E. Then G 
consists of the totality of products e m d m ... eidi for m = 1,2,..., with di £ D, 
Ci £ E (i = 1,... ,m). 2 The following theorem shows that whenever the process 
of determining a maximal invariant in steps can be carried out at all, it leads to 
a maximal invariant with respect to G. 

Theorem 6.2.2 Let G be a group of transformations, and let D and E be two 
subgroups generating G. Suppose that y = s(x) is maximal invariant with respect 
to D, and that for any e £ E 

s(xi) = s(x 2 ) implies s(ex 1 ) = s(ex 2 ). (6.8) 

If z = t(y) is maximal invariant under the group E* of transformations e* defined 
by 

e*y — s(ex) when y = s(x), 


2 See Section A.l of the Appendix. 
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then z = t[s(a;)] is maximal invariant with respect to G. 

Proof. To show that f[s(a;)] is invariant, let x' = gx, g = e m d m • • • eidi. Then 
t[s(x')]=t[s(e r ndm---eid 1 x)\ = t[e* m s{dm • • • eidkx)] 

= t[s{em-idm-i ■ ■ ■ eidi*)], 

and the last expression can be reduced by induction to t[s(a:)]. To see that t[s(a;)] 
is in fact maximal invariant, suppose that t[s(a/)] = t[s(*)]. Setting y' = s(x'), 
y = s(x), one has t(y') = t(y), and since t(y) is maximal invariant with respect 
to E*, there exists e* such that y' = e*y. Then s(x') = e*s(x) = s(ex), and by 
the maximal invariance of s(x) with respect to D there exists d £ D such that 
x' = dex. Since de is an element of G this completes the proof. ■ 

Techniques for obtaining the distribution of maximal invariants are discussed 
by Andersson (1982), Eaton (1983, 1989), Farrell (1985), Wijsman (1990) and 
Anderson (2003). 


6.3 Most Powerful Invariant Tests 

In the presence of symmetries, one may wish to restrict attention to invariant 
tests, and it then becomes of interest to determine the most powerful invariant 
test. The following is a simple example. 

Example 6.3.1 Let Xi,. ... X n be i.i.d. on (0,1) and consider testing the hy¬ 
pothesis Ho that the the common distribution of the X’s is uniform on (0,1) 
against the two alternatives H\\ 

Pi(zi,..,,*„) = f(x i) • • • f(x n ) 

and 

p 2 {xi, ...,X n ) = f( 1 - Xl) ■■■ f(l- X n ) , 
where / is a fixed (known) density. 

(i) This problem remains invariant under the 2 element group G consisting of the 
transformations 

g : x'i = 1 — Xi , i = 1,..., n 
and the identity transformation x'i = Xi for i = 1,..., n. 

(ii) The induced transformation g is the space of alternatives takes pi into p 2 and 
p 2 into pi. 

(iii) A test 4>{x \,..., *„) remains invariant under G if and only if 

<p(xi,... ,x n ) 5= 0(1 - xi,..., 1 - x n ) ■ 

(iv) There exists a UMP invariant test (i.e. an invariant test which is simul¬ 
taneously most powerful against both pi and p 2 ), and it rejects Ho when the 
average 

P(xi, ■ ■ -,X„) = ^ [pi(*l, . . . ,x„) +p 2 (xi,.. .,x n )] 


is sufficiently large. 
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We leave the proof of (i)-(iii) to Problem 6.5. To prove (iv), note that any 
invariant test satisfies 

E P1 [4>(Xr,..., X n )\ = E P2 [0(AT,..., X n )\ = E p [<j>(X 1 ,..., X n )] . 

Therefore, maximizing the power against pi or p 2 is equivalent to maximizing 
the power under p, and the result follows from the Neyman-Pearson Lemma. ■ 

This example is a special case of the following result. 


Theorem 6.3.1 Suppose the problem of testing Ho against Hi remains invariant 
under a finite group G = {gi, ..., <?jv} and that G is transitive over Ho and over 
Hi. Then there exists a UMP invariant test of Ho against Hi, and it rejects Ho 
when 

Hi=iPgMx)/N 

Y.tiP*i<hW/N 

is sufficiently large, where 9 q and 9\ are any elements of Ho and Hi, respectively. 

The proof is exactly analogous to that of the preceding example; see Problem 

6 . 6 . 

The results of the previous section provide an alternative approach to the 
determination of most powerful invariant tests. By Theorem 6.2.1, the class of 
all invariant functions can be obtained as the totality of functions of a maximal 
invariant M(x). Therefore, in particular the class of all invariant tests is the 
totality of tests depending only on the maximal invariant statistic M. The latter 
statement, while correct for all the usual situations, actually requires certain 
qualifications regarding the class of measurable sets in M-space. These conditions 
will be discussed at the end of the section; they are satisfied in the examples below. 


Example 6.3.2 Let X = (Ai,...,X„), and suppose that the density of A' 
is fi(x i — 0,... ,x n — 9) under Hi (i = 0,1), where 9 ranges from — oo to 
oo. The problem of testing H o against H\ is invariant under the group G of 
transformations 


gx = {x\ + c, ...,*„ + c), —oo < c < oo. 

which in the parameter space induces the transformations 


g9 = 9 + c. 


By Example 6.2.1, a maximal invariant under G is Y = (AT— X n ,..., X n _i — X n ). 
The distribution of Y is independent of 0 and under Hi has the density 


fi{yi +«,..., Dn -1 + Z, z) dz. 


When referred to Y, the problem of testing Ho against H i therefore becomes one 
of testing a simple hypothesis against a simple alternative. The most powerful 
test is then independent of 9, and therefore UMP among all invariant tests. Its 
rejection region by the Neyman-Pearson lemma is 


fl (yi + 2 , ■ ■ • , Vn-l + Z, Z ) dz _ fZo fl(xi+U,...,X n + U ) du 
f™ fo{yi + z,..., y n -1 + z, z) dz fo(x 1 +u,...,x„ + u ) du 
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A general theory of separate families of hypotheses (in which the family K of 
alternatives does not adjoin the hypothesis H but, as above, is separated from 
it) was initiated by Cox (1961, 1962). A bibliography of the subject is given in 
Pereira (1977); see also Loh (1985), Pace and Salvan (1990) and Rukhin (1993). ■ 

Example 6.3.2 illustrates the fact, also utilized in Theorem 6.3.1, that if the 
group G is transitive over both flo and fii, then the problem reduces to one of 
testing a simple hypothesis against a simple alternative, and a UMP invariant test 
is then obtained by the Neyman-Pearson Lemma. Note also the close similarity 
between Theorem 6.3.1 and Example 6.3.2 shown by a comparison of (6.9) and 
the right side of (6.10), where the summation in (6.9) is replaced by integration 
with respect to Lcbesgue measure. 

Before applying invariance, it is frequently convenient first to reduce the data to 
a sufficient statistic T. If there exists a test (f>o(T ) that is UMP among all invariant 
tests depending only on T, one would like to be able to conclude that <j>o(T) is 
also UMP among all invariant tests based on the original X. Unfortunately, this 
does not follow, since it is not clear that for any invariant test based on X there 
exists an equivalent test based on T, which is also invariant. Sufficient conditions 
for to have this property are provided by Hall, Wijsman, and Ghosh (1965) 

and Hooper (1982a), and a simple version of such a result (applicable to Examples 
6.3.3 and 6.3.4 below) will be given by Theorem 6.5.3 in Section 6.5. For a review 
and clarification of this and later work on invariance and sufficiency see Berk, 
Nogales, and Oyola (1996), Nogales and Oyola (1996) and Nogales, Oyola and 
Perez (2000). 

Example 6.3.3 If X\,... ,X n is a sample from Niff, a 2 ), the hypothesis H : 
a > (To remains invariant under the transformations X' = X,, + c, —oo < c < oo. 
In terms of the sufficient statistics Y = X,S 2 = E(A) — A') 2 these transfor¬ 
mations become Y' = Y + c, ( S 2 )' = S 2 , and a maximal invariant is S 2 . The 
class of invariant tests is therefore the class of tests depending on S 2 . It follows 
from Theorem 3.4.1 that there exists a UMP invariant test, with rejection region 
S(X{ — A ') 2 < C. This coincides with the UMP unbiased test (6.11). ■ 


Example 6.3.4 If Xi,...,X m and Yi,...,Y n are samples from N(f,o 2 ) and 
N(j], t 2 ), a set of sufficient statistics is T\ = X, T 2 = Y, T 3 = \/E(.Y; — A') 2 , and 
T 4 = y/E(Y) — Y) 2 . The problem of testing H : t 2 /<t 2 < Ao remains invariant 
under the transformations T[ — T\ + ci, = T 2 + C 2 , T 3 = T 3 , T 4 = T 4 , 
—00 < ci, C 2 < 00 , and also under a common change of scale of all four variables. 
A maximal invariant with respect to the first group is ( Tz,T ±). In the space of 
this maximal invariant, the group of scale changes induces the transformations 
T 3 — cTz, T'l = CT 4 , 0 < c, which has as maximal invariant the ratio T 4 /T 3 . 
The statistic Z = [T 2 /(n — 1)] -j- [T 2 /(m — 1)] on division by A = r 2 /cr 2 has an 
E-distribution with density given by (5.21), so that the density of Z is 


c(A)^("~ 3 ) 



i (m+n —2) ’ 


Z > 0 . 


For varying A, these densities constitute a family with monotone likelihood ratio, 
so that among all tests of H based on Z , and therefore among all invariant tests, 
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there exists a UMP one given by the rejection region Z > C. This coincides with 
the UMP unbiased test (5.20). ■ 


Example 6.3.5 In the method of paired comparisons for testing whether a treat¬ 
ment has a beneficial effect, the experimental material consists of n pairs of 
subjects. From each pair, a subject is selected at random for treatment while the 
other serves as control. Let X, be 1 or 0 as for the ith pair the experiment turns 
out in favor of the treated subject or the control, and let pi = P{Xi = 1}. The 
hypothesis of no effect, H : p,; = f for i = 1,..., n, is to be tested against the 
alternatives that pi > | for all i. 

The problem remains invariant under all permutations of the n variables 
Xi,..., X n , and a maximal invariant under this group is the total number of 
successes X = X\ + • • • + X n . The distribution of X is 

P{X = k} ^ qi • • • q n V ■ ■ - P % 

^ Hi H k 


where qi = 1 — p; and where the summation extends over all ())) choices of 
subscripts ii < ■ ■ ■ < ik- The most powerful invariant test against an alternative 
(p'i, ... ,p' n ) rejects H when 




To see that / is an increasing function of k, note that a; = p'i/q'i > 1, and that 


°J ai i ‘ ‘' ai k = 0 + !) a il ' ' ' ai k+ 1 
3 


and 


'y ' y ] a h ‘ ■ ■ a i k — (n k) y ] a • • • ai ki . 

3 


Here, in both equations, the second summation on the left-hand side extends over 
all subscripts i\ < • • • < ik of which none is equal to j, and the summation on 
the right-hand side extends over all subscripts *!<•■•< ik+i and ii < ■ ■ ■ < ik 
respectively without restriction. Then 

f (k + 1) = ^2 a il ■ ■ ■ a ik+1 = ( n ) ^ ^ a i a ii • • • a i k 

> ■■■cii k =f(k), 

as was to be shown. Regardless of the alternative chosen, the test therefore rejects 
when k > C, and hence is UMP invariant. If the ith comparison is considered 
plus or minus as Xi is 1 or 0, this is seen to be another example of the sign test. 
(Cf. Example 3.8.1 and Section 4.9.) ■ 


Sufficient statistics provide a simplification of a problem by reducing the sam¬ 
ple space; this process involves no change in the parameter space. Invariance, 
on the other hand, by reducing the data to a maximal invariant statistic M, 
whose distribution may depend only on a function of the parameter, typically 
also shrinks the parameter space. The details are given in the following theorem. 
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Theorem 6.3.2 If M(x) is invariant under G, and if v{ 6 ) maximal invariant 
under the induced group G, then the distribution of M(X) depends only on v( 8 ). 


Proof. Let v(9 1 ) = v{ 02 ). Then 82 = gd 1 , and hence 

Pe 2 {M{X) £ B} = P- g6l {M(X) £ B} = P 9l {M(gX) £ B} 

= P 9 l {M(X)eB}. 

This result can be paraphrased by saying that the principle of invariance identifies 
all parameter points that are equivalent with respect to G. ■ 

In application, for instance in Examples 6.3.3 and 6.3.4, the maximal invariants 
M(x) and 5 = v( 8 ) under G and G are frequently real-valued, and the family of 
probability densities ps{m) of M has monotone likelihood ratio. For testing the 
hypothesis H : 5 < So there exists then a UMP test among those depending only 
on M, and hence a UMP invariant test. Its rejection region is M > C, where 


Psq (m) dm = a. 


( 6 . 11 ) 


Consider this problem now as a two-decision problem with decisions do and do 
of accepting or rejecting H, and a loss function L( 8 ,di) = Li( 8 ). Suppose that 
Li(9 ) depends only on the parameter 5, Li(9 ) = L'(<5) say, and satisfies 


L[{S) -L' 0 (S) Z 0 as S ^ S 0 . (6.12) 


It then follows from Theorem 3.4.2 that the family of rejection regions M > C(a), 
as a varies from 0 to 1 , forms a complete family of decision procedures among 
those depending only on M, and hence a complete family of invariant procedures. 
As before, the choice of a particular significance level a can be considered as a 
convenient way of specifying a test from this family. 

At the beginning of the section it was stated that the class of invariant tests 
coincides with the class of tests based on a maximal invariant statistic M = 
M(X). However, a statistic is not completely specified by a function, but requires 
also specification of a class B of measurable sets. If in the present case B is the 
class of all sets B for which M~ 1 (B) £ A, the desired statement is correct. For 
let rp(x ) = ip[M(x)] and <j> by A-measurable, and let C be a Borel set on the 
line. Then 0 -1 (C) = M -1 [4/j -1 (C)] £ A and hence t/ ,-1 (C) £ B, so that ip is 
Z3-measurable and <p(x ) = ip[M(x)\ is a test based on the statistic M. 

In most applications, M(x) is a measurable function taking on values in a 
Euclidean space and it is convenient to take B as the class of Borel sets. If 4>(x) = 
ip[M(x)\ is then an arbitrary measurable function depending only on M(x), it 
is not clear that ip(m) is necessarily B-measurable. This measurability can be 
concluded if X is also Euclidean with A the class of Borel sets, and if the range 
of M is a Borel set. We shall prove it here only under the additional assumption 
(which in applications is usually obvious, and which will not be verified explicitly 
in each case) that there exists a vector-valued Borel-measurable function Y{x) 
such that [M(x), Y(x)] maps X onto a Borel subset of the product space Mxy, 
that this mapping is 1 : 1, and that the inverse mapping is also Borel-measurable. 
Given any measurable function <p of x, there exists then a measurable function 
<f> of ( m,y ) such that <p(x) = rp'[M(x),Y( x )]- If 4> depends only on M(x), then 
<f> depends only on m, so that <p'(m,y) = ip(m) say, and ip is a measurable 
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function of m. 3 In Example 6.2.1 (i) for instance, where x = (xi,...x„) and 
M(x) = (xi — x n ,..., Xn -1 — x„), the function Y(x) can be taken as Y(x) = x„- 


6.4 Sample Inspection by Variables 

A sample is drawn from a lot of some manufactured product in order to decide 
whether the lot is of acceptable quality. In the simplest case, each sample item is 
classified directly as satisfactory or defective (inspection by attributes), and the 
decision is based on the total number of defectives. More generally, the quality 
of an item is characterized by a variable Y (inspection by variables), and an item 
is considered satisfactory if Y exceeds a given constant u. The probability of a 
defective is then 


p = P{Y < u} 

and the problem becomes that of testing the hypothesis H : p > po- 

As was seen in Example 3.8.1, no use can be made of the actual value of Y 
unless something is known concerning the distribution of Y. In the absence of 
such information, the decision will be based, as before, simply on the number of 
defectives in the sample. We shall consider the problem now under the assumption 
that the measurements hi,..., Y n constitute a sample from N(p, a 2 ). Then 

where 

$(y)=[ -^=ex P (-ft 2 ) dt 

J —oo v "7T 

denotes the cumulative distribution function of a standard normal distribution, 
and the hypothesis H becomes (u — ri)/cr > 4> _1 (po). In terms of the variables 
Xi = Yi — u, which have mean £ = p — u and variance a 2 , this reduces to 

: - < 6> o 
<7 

with do = —$ _1 (po). This hypothesis, which was considered in Section 5.2, for 
#o = 0, occurs also in other contexts. It is appropriate when one is interested in 
the mean £ of a normal distribution, expressed in a units rather than on a fixed 
scale. 

For testing H, attention can be restricted to the pair of variables X and 
s = VE ^ - xy, since they form a set of sufficient statistics for (£, <r), which 
satisfy the conditions of Theorem 6.5.3 of the next section. These variables are 
independent, the distribution of A' being V(£, <r 2 /n) and that of S/a being Xn-i- 
Multiplication of X and S by a common constant c > 0 transforms the parame¬ 
ters into = ct;,a' = ca, so that £/<r and hence the problem of testing H remain 


P = 


L 


— oo v 27TC 


exp 




3 The last statement follows, for example, from Theorem 18.1 of Billingsley (1995). 
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invariant. A maximal invariant under these transformations is x/s or 

s/nx 


t = 


s/yjn — 1 ’ 


the distribution of which depends only on the maximal invariant in the parameter 
space 6 = £/a (cf. Section 5.2). Thus, the invariant tests are those depending only 
on t, and it remains to find the most powerful test of H : 9 < do within this class. 
The probability density of t is (Problem 5.3) 


ps{t)=c l exp 


exp (— |w) dw, 


n 00 

Jo ex P 


w^ n ^ exp(— |ui) dw 

poo 

Jo ex P 


w h( n - 2 ) ex p(_i u) ) d w 


where S = y/nO is the noncentrality parameter, and this will now be shown to 
constitute a family with monotone likelihood ratio. To see that the ratio 


r(t) = 


is an increasing function of t for Jo < (Si, suppose first that t < 0 and let v = 
—ty/w/(n — 1) . The ratio then becomes proportional to 


where 


fS° /(”) ex P -(5l-<5o)v- 

_ _ J 

Jo°° f(v) ex P [- ( " 2 t 2 V ~] dv 

= f exp[—(Ji — So)v]g t 2 (v) dv 


f(v) = exp(— 5ov)v n 1 exp(— v 2 /2) 


and 


fft 2 (v) = 


f{v )exp 

(n — l)v 2 


2t 2 

So 00 f{z) exp 

(n-l)z 2 ' 

dz 

2t 2 


Since the family of probability densities g t 2 (v) is a family with monotone like¬ 
lihood ratio, the integral of exp[— (<5i — <5o)w] with respect to this density is a 
decreasing function of f 2 (Problem 3.39), and hence an increasing function of t 
for t < 0. Similarly one finds that r(t) is an increasing function of t for t > 0 
by making the transformation v = t^/w/(n — 1). By continuity it is then an 
increasing function of t for all t. 

There exists therefore a UMP invariant test of H : £/<r < do, which rejects 
when t > C, where C is determined by (6.11). In terms of the original variables 
Yi the rejection region of the UMP invariant test of H : p > po becomes 


y/n(y - u) 


VYriVi - y) 2 /(n - 1) 


> C. 


(6.13) 


If the problem is considered as a two-decision problem with losses Lo(p) and 
Li(p) for accepting or rejecting p > po, which depend only on p and satisfy the 
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condition corresponding to (6.12), the class of tests (6.13) constitutes a complete 
family of invariant procedures as C varies from —oo to oo. 

Consider next the comparison of two products on the basis of samples 
Xu ..., X m ■ Yu-.., Y n from N(£, a 2 ) and N{ V , a 2 ). If 



one wishes to test the hypothesis p < n, which is equivalent to 

H:p <£• 

The statistics X, Y, and S = \/'^2{Xi — A ') 2 + ^2(Yj — Y ) 2 are a set of sufficient 
statistics for £, r /, a. The problem remains invariant under the addition of an 
arbitrary common constant to X and Y, which leaves Y — X and S as maximal 
invariants. It is also invariant under multiplication of X , Y , and S, and hence of 
Y — X and S, by a common positive constant, which reduces the data to the 
maximal invariant (V — X)/S. Since 

(»-5)/y/£ + £ 

s/y/m + n — 2 

has a noncentral f-distribution with noncentrality parameter <5 = y/mntg 7 — £)/ 
\/m + na, the UMP invariant test of H : r/ — £ < 0 rejects when t > C. This 
coincides with the UMP unbiased test (5.27). Analogously, the corresponding 
two-sided test (5.30), with rejection region |f| > C, is UMP invariant for testing 
the hypothesis p = n against the alternatives p^jr (Problem 6.18). 


6.5 Almost Invariance 

Let G be a group of transformations leaving a family V = {Vo, 9 G 0 } of distri¬ 
butions of X invariant. A test <j> is said to be equivalent to an invariant test if 
there exists an invariant test <j> such that cj>{x ) = ip{x) for all x except possibly 
on a "P-null set V; <j> is said to be almost invariant with respect to G if 

4>{gx) — 4>{x) for all x £ X — N g , g&G (6-14) 

where the exceptional null set N g is permitted to depend on g. This concept 
is required for investigating the relationship of invariance to unbiasedness and 
to certain other desirable properties. I 11 this connection it is important to know 
whether a UMP invariant test is also UMP among almost invariant tests. This 
turns out to be the case under assumptions which are made precise in Theorem 
6.5.1 below and which are satisfied in all the usual applications. 

If <\> is equivalent to an invariant test, then (/){gx) = <p(x ) for all x N U g~ 1 N. 
Since Po^g^ 1 ^ = P g g(N) = 0, it follows that <j> is then almost invariant. The 
following theorem gives conditions under which conversely any almost invariant 
test is equivalent to an invariant one. 

Theorem 6.5.1 Let G be a group of transformations of X, and let A and B be 
a-fields of subsets of X and G such that for any set A £ A the set of pairs (x,g) 
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for which gx £ A is measurable Ax B. Suppose further that there exists a a-finite 
measure v over G such that v{B) = 0 implies v(Bg) = 0 for all g £ G. Then any 
measurable function that is almost invariant under G (where “almost” refers to 
some a-finite measure g) is equivalent to an invariant function. 


Proof. Because of the measurability assumptions, the function 4>(gx) considered 
as a function of the two variables x and g is measurable A x B. It follows that 
<p(gx) — 4>(x) is measurable A x B, and so therefore is the set S of points (x, g) 
with rp(gx) ^ rp(x ). If <p is almost invariant, any section of S with fixed g is a 
/u-null set. By Fubini’s theorem (Theorem 2.2.4), there exists therefore a /r-null 
set N such that for all x £ X — N 


4 >(gx) = <p(x) a.e. v. 

Without loss of generality suppose that v(G) = 1, and let A be the set of points 
x for which 



’x) du(g') = <p(gx) 


a.e. v. 


If 


f(x,g) = 


J 4>{g'x) dv(g') — 4 >(gx) 


then A is the set of points x for which 


J f{x,g)dv(g) = 0. 

Since this integral is a measurable function of x, it follows that A is measurable. 
Let 

_ / S 4>{gx)dvl y g) if xeA, 

’ - \ 0 if x t A. 

Then ip is measurable and ip(x) = 4>(x) for x (f N, since <p(gx) = (p(x ) a.e. v 
implies that f (p(g'x) dis(g') = <p(x) and that a; £ A To show that ip is invariant 
it is enough to prove that the set A is invariant. For any point x £ A, the function 
<p{gx) is constant except on a null subset N x of G. Then rp(ghx) has the same 
constant value for all g Nj-h^ 1 , which by assumption is again a i'-null set; and 
hence hx £ A, which completes the proof. ■ 

Additional results concerning the relation of invariance and almost invariance 
are given by Berk and Bickel (1968) and Berk (1970). In particular, the basic 
idea of the following example is due to Berk (1970). 


Example 6.5.1 (Counterexample) Let Z, be independently dis¬ 

tributed as N(0, 1), and consider the 1 : 1 transformations y[ = yi(i = 1,... ,n) 
and 

z 1 = z except for a finite number of points ai,...,ak for which 
a'i = ay*, for some permutation (ji, ... ,jk) of ( 1 ,..., k). 

If the group G is generated by taking for (ai,..., at,), k — 1,2,, all finite sets 
and for (j i, ... ,jk) all permutations of ( 1 , ..., k), then (z, y\, ..., y n ) is almost 
invariant It is however not equivalent to an invariant function, since (yi,... ,y n ) 
is maximal invariant. ■ 
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Corollary 6.5.1 Suppose that the problem of testing H : 9 £ oj against K : 8 € 
Q — u> remains invariant under G and that the assumptions of Theorem 6.5.1 
hold. Then if (po is UMP invariant, it is also UMP within the class of almost 
invariant tests. 

Proof. If <p is almost invariant, it is equivalent to an invariant test ip by Theorem 
6.5.1. The tests <p and ip have the same power function, and hence (po is uniformly 
at least as powerful as <j>. ■ 

In applications, V is usually a dominated family, and p any cr-finite measure 
equivalent to V (which exists by Theorem A.4.2 of the Appendix). If cp is almost 
invariant with respect to V, it is then almost invariant with respect to p and 
hence equivalent to an invariant test. Typically, the sample space X is an n- 
dinrensional Euclidean space, A is the class of Borel sets, and the elements of G 
are transformations of the form y = f(x, r), where r ranges over a set of positive 
measure in an m-dinrensional space and / is a Borel-measurable vector-valued 
function of m + n variables. If B is taken as the class of Hotel sets in m-space the 
measurability conditions of the theorem are satisfied. 

The requirement that for all g £ G and B £ B 

v{B) — 0 implies v(Bg) = 0 (6.15) 

is satisfied in particular when 

v{Bg) = v(B) for all g £ G, B € B. (6.16) 

The existence of such a right invariant measure is guaranteed for a large class 
of groups by the theory of Haar measure. (See, for example, Eaton (1989).) 
Alternatively, it is usually not difficult to check the condition (6.15) directly. 


Example 6.5.2 Let G be the group of all nonsingular linear transformations of 
n-space. Relative to a fixed coordinate system the elements of G can be repre¬ 
sented by nonsingular n x n matrices A = ( aij),A' = (a^),... with the matrix 
product serving as the group product of two such elements. The cr-field B can be 
taken to be the class of Borel sets in the space of the n 2 elements of the matrices, 
and the measure v can be taken as Lebesgue measure over B. Consider now a set 
S of matrices with u(S) = 0, and the set S* of matrices A!A with A' £ S and A 
fixed. If a = max|aij|, C' = A'A, and C" = A”A, the inequalities \a"j — a'ij | < e 
for all i, j imply \c"j — c[j\ < nae. Since a set has ^-measure zero if and only if 
it can be covered by a union of rectangles whose total measure does not exceed 
any given e > 0, it follows that u(S*) = 0, as was to be proved. ■ 

In the preceding chapters, tests were compared purely in terms of their power 
functions (possibly weighted according to the seriousness of the losses involved). 
Since the restriction to invariant tests is a departure from this point of view, 
it is of interest to consider the implications of applying invariance to the power 
functions rather than to the tests themselves. Any test that is invariant or almost 
invariant under a group G has a power function which is invariant under the group 
G induced by G in the parameter space. 

To see that the converse is in general not true, let AT, AT, .A 3 be independently, 
normally distributed with mean £ and variance cr 2 , and consider the hypothesis 
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a > no- The test with rejection region 

IX 2 — AT| > k when X < 0, 

| A 3 — A 2 I > k when X > 0 

is not invariant under the group G of transformations X[ = Xi + c, but its power 
function is invariant under the associated group G. 

The two properties, almost invariance of a test <p and invariance of its power 
function, become equivalent if before the application of invariance considerations 
the problem is reduced to a sufficient statistic whose distributions constitute a 
boundedly complete family. 

Lemma 6.5.1 Let the family V T = {Pj ,9 £ Q} of distributions of T be bound¬ 
edly complete, and let the problem of testing H : 9 £ Qh remain invariant under 
a group G of transformations ofT. Then a necessary and sufficient condition for 
the power function of a test ip{t) to be invariant under the induced group G over 
Q is that ip(t) is almost invariant under G. 

Proof. For all 9 £ SI we have Eggip(T) = Egip(gT). If if is almost invariant, 
Egip(T) = Egip(gT) and hence EggipfT) = Egip(T), so that the power function 
of ip is invariant. Conversely, if Egip(T) = Eggip(T), then Egip(T) = Egip(gT), 
and by the bounded completeness of V T ', we have ip{gt ) = ip(t) a.e. V T . ■ 

As a consequence, it is seen that UMP almost invariant tests also possess the 
following optimum property. 

Theorem 6.5.2 Under the assumptions of Lemma 6.5.1, let v(9) be maximal 
invariant with respect to G, and suppose that among the tests of H based on the 
sufficient statistic T there exists a UMP almost invariant one, say ipo(t). Then 
ipo(t) is UMP in the class of all tests based on the original observations X, whose 
power function depends only on v(9). 

Proof. Let <p{x) be any such test, and let ip(t) = E[(p{X)\t\. The power function 
of ip(t), being identical with that of (p(x), depends then only on v(9), and hence 
is invariant under G. It follows from Lemma 6.5.1 that ip(t) is almost invariant 
under G, and ipo(t) is uniformly at least as powerful as ip(t) and therefore as 
(p(x). m 

Example 6.5.3 For the hypothesis r 2 < <r 2 concerning the variances of two 
normal distributions, the statistics (A ,Y, S%, Sy) constitute a complete set of 
sufficient statistics. It was shown in Example 6.3.4 that there exists a UMP 
invariant test with respect to a suitable group G, which has rejection region 
Sy/S x > Co- Since in the present case almost invariance of a test with respect 
to G implies that it is equivalent to an invariant one (Problem 6.21), Theorem 
6.5.2 is applicable with v(9) = A = r 2 /cr 2 , and the test is therefore UMP among 
all tests whose power function depends only on A. ■ 

Theorem 6.5.1 makes it possible to establish a simple condition under which 
reduction to sufficiency before the application of invariance is legitimate. 
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Theorem 6.5.3 Let X be distributed according to Pe, 9 £ LI, and let T be suf¬ 
ficient for 6. Suppose G leaves invariant the problem of testing H : 9 £ LLh , and 
that T satisfies 

T(x i) = T(x 2 ) implies T(gxi) = T(gx 2 ) for all g £ G, 
so that G induces a group G of transformations of T-space through 

gT(x) = T(gx). 

(i) If p(x) is any invariant test of H , there exists an almost invariant test ip 
based on T, which has the same power function as ip. 

(ii) If in addition the assumptions of Theorem 6.5.1 are satisfied, the test ip 
of (i) can be taken to be invariant. 

(iii) If there exists a test ipo{T) which is UMP among all G-invariant tests 
based on T, then under the assumptions of (ii), ipo, is also UMP among all 
G-invariant tests based on X. 

This theorem justifies the derivation of the UMP invariant tests of Examples 
6.3.3 and 6.3.4. 

Proof, (i): Let ip{t) = E[<p(X)\t\. Then ip has the same power function as p. To 
complete the proof, it suffices to show that ip(t) is almost invariant, i.e. that 

${gt) = '•Pit) (a-e. P T ). 

It follows from (1) that 

E e [<p(gX)\gt} = E- g o[<p(X)\t] (a.e. P e ). 

Since T is sufficient, both sides of this equation are independent of 9. Furthermore 
<p(gx) = p(x) for all x and g, and this completes the proof. ■ 

Part (ii) follows immediately from (i) and Theorem 6.5.1, and part (iii) from 
(ii). 


6.6 Unbiasedness and Invariance 

The principles of unbiasedness and invariance complement each other in that each 
is successful in cases where the other is not. For example, there exist UMP unbi¬ 
ased tests for the comparison of two binomial or Poisson distributions, problems 
to which invariance considerations are not applicable. UMP unbiased tests also 
exist for testing the hypothesis a = ao against a ^ 00 in a normal distribution, 
while invariance does not reduce this problem sufficiently far. Conversely, there 
exist UMP invariant tests of hypotheses specifying the values of more than one 
parameter (to be considered in Chapter 7) but for which the class of unbiased 
tests has no UMP member. There are also hypotheses, for example the one-sided 
hypothesis £/cr < 9o in a univariate normal distribution or p < po in a bivariate 
one (Problem 6.19) with 9o,po 7 ^ 0, where a UMP invariant test exists but the 
existence of a UMP unbiased test does not follow by the methods of Chapter 5 
and is an open question. 

On the other hand, to some problems both principles have been applied success¬ 
fully. These include Student’s hypotheses £ < £0 and £ = £0 concerning the mean 
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of a normal distribution, and the corresponding two sample problems rj — £ < Ao 
and — £ = Ao when the variances of the two samples are assumed equal. Other 
examples are the one-sided hypotheses a 2 > a 2 and r 2 /cr 2 > Ao concerning the 
variances of one or two normal distributions. The hypothesis of independence 
p = 0 in a bivariate normal distribution is still another case in point (Problem 
6.19). In all these examples the two optimum procedures coincide. We shall now 
show that this is not accidental but is the case whenever the UMP invariant 
test is UMP also among all almost invariant tests and the UMP unbiased test is 
unique. In this sense, the principles of unbiasedness and of almost invariance are 
consistent. 

Theorem 6.6.1 Suppose that for a given testing problem there exists a UMP 
unbiased test rf>* which is unique (up to sets of measure zero), and that there also 
exists a UMP almost invariant test with respect to some group G. Then the latter 
is also unique (up to sets of measure zero), and the two tests coincide a.e. 

Proof. If U(a) is the class of unbiased level-a tests, and if g £ G, then (j> £ U(a) 
if and only if <j>g £ U(a). 4 Denoting the power function of the test 4> by (3^(9 ), 
we thus have 

/Vs(0) = P<i>*{g0)= sup Mg 6 ) = sup (3<t> g {9) 

<t>eu(<x) <peU(a ) 

= sup P 4 , g (ff) =/3t*{6). 

<l>geU(a) 

It follows that <).)* and <f>*g have the same power function, and, because of 
the uniqueness assumption, that <j >* is almost invariant. Therefore, if <j>' is UMP 
almost invariant, we have P<j,>{9) > P < j>»(9) for all 9. On the other hand, <j>' is 
unbiased, as is seen by comparing it with the invariant test <t>(x) = a, and hence 
(#) < P4>* (#) f° r all Since <j>' and </>* therefore have the same power function, 
they are equal a.e. because of the uniqueness of <j>*, as was to be proved. ■ 

This theorem provides an alternative derivation for some of the tests of Chapter 
5. In Theorem 4.4.1, the existence of UMP unbiased tests was established for one- 
and two-sided hypotheses concerning the parameter 9 of the exponential family 
(4.10). For this family, the statistics ( U,T ) are sufficient and complete, and in 
terms of these statistics the UMP unbiased test is therefore unique. Convenient 
explicit expressions for some of these tests, which were derived in Chapter 5, can 
instead be obtained by noting that when a UMP almost invariant test exists, the 
same test by Theorem 6.6.1 must also be UMP unbiased. This proves for example 
that the tests of Examples 6.3.3 and 6.3.4 are UMP unbiased. 

The principles of unbiasedness and invariance can be used to supplement each 
other in cases where neither principle alone leads to a solution but where they 
do so when applied in conjunction. As an example consider a sample X \,..., X n 
from N (£, <r 2 ) and the problem of testing H : £/cr = 9o ^ 0 against the two-sided 
alternatives that (/a ^ 9o. Here sufficiency and invariance reduce the problem 
to the consideration of t = \/nx/yf^2(xi — x) 2 /(n — 1). The distribution of this 
statistic is the noncentral f-distribution with noncentrality parameter 5 = y/n(,/o 
and 7i—l degrees of freedom. For varying <5, the family of these distributions can 


4 <j>g denotes the critical function which assigns to x the value <p(gx‘). 
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be shown to be STPoo. [Karlin (1968, pp. 118-119; see Problem 3.50] and hence 
in particular STP 3 . It follows by Problem 6.42 that among all tests of H based on 
t, there exists a UMP unbiased one with acceptance region Ci < t < C 2 , where 
Ci, C 2 are determined by the conditions 

= 0 . 

<5=«o 

In terms of the original observations, this test then has the property of being 
UMP among all tests that are unbiased and invariant. Whether it is also UMP 
unbiased without the restriction to invariant tests is an open problem. 

An analogous example occurs in the testing of the hypotheses H : p = po 
and H' : pi < p < p 2 against two-sided alternatives on the basis of a sample 
from a bivariate normal distribution with correlation coefficient p. (The testing 
of p < po against p > po is treated in Problem 6.19.) The distribution of the 
sample correlation coefficient has not only monotone likelihood ratio as shown in 
Problem 6.19, but is in fact STPoo. [Karlin (1968, Section 3.4)]. Hence there exist 
tests of both H and H' which are UMP among all tests that are both invariant 
and unbiased. 

Another case in which the combination of invariance and unbiasedness appears 
to offer a promising approach is the Behrens-Fisher problem. Let X \,..., X m 
and Yi,...,y n be samples from normal distributions N(£,a 2 ) and N(ti,t 2 ) 
respectively. The problem is that of testing H : p < £ (or r/ = £) with¬ 
out assuming equality of the variances a 2 and r 2 . A set of sufficient statistics 
for (£,77, <7, r) is then (X,Y , Sx, Sy), where Sx = J2(Xi — X) 2 /(m — 1) and 
Sy = XXXi — Y) 2 /(n — 1). Adding the same constant to X and Y reduces the 
problem to Y — X, Sx, Sy, and multiplication of all variables by a common 
positive constant to (Y — X)/\JS\ + Sy and Sy/S\. One would expect any 
reasonable invariant rejection region to be of the form 


Ps 0 {Ci <t< C 2 } = 1 — a and 


dPs {Ci<t< C 2 } 
dS 



(6.17) 


for some suitable function g. If this test is also to be unbiased, the probability 
of (6.17) must equal a when rj = £ for all values of r/a. It has been shown 
by Linnik and others that only pathological functions g with this property can 
exist. [This work is reviewed by Pfanzagl (1974).] However, approximate solutions 
are available which provide tests that are satisfactory for all practical purposes. 
These are the Welch approximate f-solution described in Section 11.3, and the 
Welch-Aspin test. Both are discussed, and evaluated, in Scheffe (1970) and Wang 
(1971); see also Chernoff (1949), Wallace (1958), Davenport and Webster (1975) 
and Robinson (1982). The Behrens-Fisher problem will be revisited in Examples 
13.5.4 and 15.6.3 and Section 15.2. 


The property of a test <f> 1 being UMP invariant is relative to a particular group 
G 1 , and does not exclude the possibility that there might exist another test <j > 2 
which is UMP invariant with respect to a different group G 2 . Simple instances 
can be obtained from Examples 6.5.1 and 6.6.11. 


Example 6.6.8 (continued) If Gi is the group G of Example 6.5.1, a UMP 
invariant test of H : 9 < 6 q against 0 > 6 0 rejects when Yi + ■ ■ ■ + Y n > C. 



232 


6. Invariance 


Let G 2 be the group obtained by interchanging the role of Z and Y\. Then a 
UMP invariant test with respect to G 2 rejects when Z + Y2 + • • • + Y n > C. 
Analogous UMP invariant tests are obtained by interchanging the role of Z and 
any one of the other V’s and further examples by applying the transformations 
of G in Example 6.5.1 to more than one variable. In particular, if it is applied 
independently to all n + 1 variables, only the constants remain invariant, and the 
test <j> = a is UMP invariant. ■ 


Example 6.6.11 For another example (due to Charles Stein), let (An, * 12 ) 
and (X 21 , X 22 ) be independent and have bivariate normal distributions with zero 
means and covariance matrices 

( 1 and ( A ACTf Apa T ) ■ 

\ pOl(T2 2 ) \ XpO\G2 A(7 2 ) 

Suppose that these matrices are nonsingular, or equivalently that \p\ 1, but that 

all ( 7 i, < 72 , p, and A are otherwise unknown. The problem of testing A = 1 against 
A > 1 remains invariant under the group Gi of all nonsingular transformations 


X' n = bX n 

X'i2 = a 1 X, 1 -(- 0:2 X ,2 


(o 2 , b > 0 ). 


Since the probability is 0 that AT 1 A 22 = Ai 2 A' 2 i, the 2 x 2 matrix (Afy) is 
nonsingular with probability 1 , and the sample space can therefore be restricted 
to be the set of all nonsingular such matrices. A maximal invariant under the 
subgroup corresponding to b = 1 is the pair (An, A' 2 i). The argument of Example 
6.3.4 then shows that there exists a UMP invariant test under Gi which rejects 
when A|i A?! > G. 

By interchanging 1 and 2 in the second subscript of the A’s one sees that under 
the corresponding group G 2 the UMP invariant test rejects when A§ 2 A | 2 > C. 

A third group leaving the problem invariant is the smallest group containing 
both Gi and G 2 , namely the group G of all common nonsingular transformations 

A(i = OiiA'ii + (Z12Aj 2 _ 1 

X'n = 02!Ail + 022**2 ’ 


Given any two nonsingular sample points Z = (Xij) and Z' = (*b), there exists 
a nonsingular linear transformation A such that Z' = AZ. There are therefore 
no invariants under G, and the only invariant size-Q test is <j> = a. It follows 
vacuously that this is UMP invariant under G. ■ 


6.7 Admissibility 

Any UMP unbiased test has the important property of admissibility (Problem 
4.1), in the sense that there cannot exist another test which is uniformly at least 
as powerful and against some alternatives actually more powerful than the given 
one. The corresponding property does not necessarily hold for UMP invariant 
tests, as is shown by the following example. 

Example 6.7.11 (continued) Under the assumptions of Example 6.6.11 it was 
seen that the UMP invariant test under G is the test ip = a which has power 
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/3(A) = a. On the other hand, Xu and X 21 are independently distributed as 
X(0,<7i) and N(0, Aaf). On the basis of these observations there exists a UMP 
test for testing A = 1 against A > 1 with rejection region X^/X^ > C (Problem 
3.62). The power function of this test is strictly increasing in A and hence > a 
for all A > 1. ■ 

Admissibility of optimum invariant tests therefore cannot be taken for granted 
but must be established separately for each case. 

We shall distinguish two slightly different concepts of admissibility. A test po 
will be called a-admissible for testing H : 9 G Q.h against a class of alternatives 
0 G 1L if for any other level-a test p 

Egp(X) > Egp o(X) for all 9 G Q' (6.18) 

implies Egp(X) = Egp o(X) for all 9 G 12'. This definition takes no account of 
the relationship of Egp(X) and Egp >o(X) for 9 G 12# beyond the requirement 
that both tests are of level a. For some unexpected, and possibly undesirable 
consequences of a-admissibility, see Perlman and Wu (1999). A concept closer to 
the decision-theoretic notion of admissibility discussed in Section 1.8, defines po 
to be d-admissible for testing H against 12' if (6.18) and 

Egp(X) < Egp o(X) for all 9 G 12# (6.19) 

jointly imply Egp(X) = Egp o(X) for all 9 G 12# U 12' (see Problem 6.32). 

Any level-Q test po that is Q-admissible is also d-admissible provided no other 
test p exists with Egp(X) = Egp o(A') for all 9 G 12' but Egp(X) ^ Egp o(X) 
for some 9 G 12#. That the converse does not hold is shown by the following 
example. 

Example 6.7.12 Let X be normally distributed with mean £ and known vari¬ 
ance a 2 . For testing H : ^ — 1 or > 1 against 12' : £ = 0, there exists a level-a 
test po, which rejects when C\ < X < C 2 and accepts otherwise, such that 
(Problem 6.33) 

E^p o(X) < i<^o(X) = a for ^ < — 1 

and 

E^po(X) < E^ =+ ipo(X) = a < a for £ > +1. 

A slight modification of the proof of Theorem 3.7.1 shows that po is the unique 
test maximizing the power at £ = 0 subject to 

E^p(X) < a for £ < —1 and E^p(X) < a for £ > 1, 

and hence that po is d-admissible. 

On the other hand, the test p with rejection region |X| < C, where 

E^=-ip(X) = E^ = ip(X) = a, is the unique test maximizing the power at £ = 0 
subject to E^p(X) < a for £ < —1 or > 1, and hence is more powerful against 
12' than po, so that po is not a-admissible. ■ 

A test that is admissible under either definition against 12' is also admissible 
against any Q" containing 12' and hence in particular against the class of all 
alternatives 12 k = 12 — 12 h- The terms a- and d-admissible without qualification 
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will be reserved for admissibility against Q.k- Unless a UMP test exists, any a- 
admissible test will be admissible against some fY C LIk and inadmissible against 
others. Both the strength of an admissibility result and the method of proof will 
depend on the set fY. 

Consider in particular the admissibility of a UMP unbiased test mentioned at 
the beginning of the section. This does not rule out the existence of a test with 
greater power for all alternatives of practical importance and smaller power only 
for alternatives so close to H that the value of the power there is immaterial. 
In the present section, we shall discuss two methods for proving admissibility 
against various classes of alternatives. 


Theorem 6.7.1 Let X be distributed according to an exponential family with 
density 

pe{x) = C(0) exp ( QjTj (x) 

\i=i 

with respect to a cr-finite measure p over a Euclidean sample space (X,A), and 
let Q be the natural parameter space of this family. Let S2 h and fY be disjoint 
nonempty subsets of fi, and suppose that <po is a test of H : 9 £ Qh based on 
T = (Ti,... ,T S ) with acceptance region Ao which is a closed convex subset of R 3 
possessing the following property: If Ao D {X] OjU > c} is empty for some c, there 
exists a point 8* £ Q and a sequence X n —» oo such that 8* + X n a £ fY [where X n 
is a scalar and a = (ai,..., a s )]. Then if A is any other acceptance region for H 
satisfying 

Pg(X 6 A) < Pg{X £ A 0 ) for all 9 £ fY, 

A is contained in Ao, except for a subset of measure 0, i.e. p(A n Aq) = 0. 


Proof. Suppose to the contrary that p(A n Ao) > 0. Then it follows from the 
closure and convexity of Ao, that there exist a £ R 3 and a real number c such 
that 

A° n {t : adi > c} is empty (6.20) 

and 

Afl {( : > c} has positive /j-measure, (6-21) 

that is, the set A protrudes in some direction from the convex set Ao. We shall 
show that this fact and the exponential nature of the densities imply that 

Pg(A) > Pg(Ao) for some 9 £ SY, ( 6 . 22 ) 

which provides the required contradiction. Let (po an d y> denote the indicators of 
Aq and A respectively, so that ( 6 . 22 ) is equivalent to 


j[po{t) - <p(t)] dPg(t ) > 0 


for some 9 £ fY. 


If 9 = 8 *+ X„a £ fY, the left side becomes 

[po(t) - v?(t)]e AnE “^- c ) dPg . (t). 


C(9 + A n a) c \ 
-e 


C(9* 
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Let this integral be I if + 1„ , where iff and 1,7 denote the contributions over the 
regions of integration {t : Yf > c } and {t : Y a»ti — °} respectively. Since I~ 
is bounded, it is enough to show that iff -> oo as n -> oo. By (6.20), <po{t) = 1 
and hence <po(t) — <p(t) > 0 when Y ciiU > c, and by (6.21) 

fji |<7’o(f) ~ <p(t) > 0 and ^ mU > cj > 0. 

This shows that iff —¥ oo as A™ —» oo and therefore completes the proof. ■ 

Corollary 6.7.1 Under the assumptions of Theorem 6.7.1, the test with accep¬ 
tance region Ao is d-admissible. If its size is a and there exists a finite point do 
in the closure CIh of ULh for which Eg 0 tpo(X) = a, then ipo is also a-admissible. 

Proof. 

(i) Suppose tp satisfies (6.18). Then by Theorem 6.7.1, <po(x) < ip(x) (a.e. p). If 
ipo(x) < ip( x ) on a se t °f positive measure, then Egipo(X) < EgtpfX) for all 
6 and hence (6.19) cannot hold. 

(ii) By the argument of part (i), (6.18) implies a = Eg 0 po(X) < Ee 0 ip{X ), and 
hence by the continuity of Egip(X) there exists a point 9 £ Qh for which 
a < Egip(X). Thus ip is not a level-a test. ■ 

Theorem 6.7.1 and the corollary easily extend to the case where the com¬ 
petitors ip of po are permitted to be randomized but the assumption that tpo 
is nonrandomized is essential. Thus, the main applications of these results are 
to the case that p is absolutely continuous with respect to Lebesgue measure. 
The boundary of Ao will then typically have measure zero, so that the closure 
requirement for Ao can be dropped. 

Example 6.7.13 (Normal mean) If X \,..., X n is a sample from the normal 
distribution N(£,o 2 ), the family of distributions is exponential with T\ = X , 
T 2 = 'YfXi, 9\ = n£/a 2 , 62 = —l/2a 2 . Consider first the one-sided problem 
H : 9\ < 0, K : 9i > 0 with a < |. Then the acceptance region of the t-test is 
A : Ti/\/T 2 < C [C > 0), which is convex [Problem 6.34(i)[. The alternatives 
9 £ Q! C K will satisfy the conditions of Theorem 6.7.1 if for any half plane 
ait\ + > c that does not intersect the set t\ < C^/tf there exists a ray 

(9* + Aai, #2 + A(i 2 ) in the direction of the vector (ai, 02 ) for which ( 6 * + Aai, #2 + 
Xa 2 ) G lY for all sufficiently large A. In the present case, this condition must hold 
for all a\ > 0 > < 22 . Examples of sets fY satisfying this requirement (and against 
which the t-test is therefore admissible) are 

Oi : 9\ > fci or 4^ > k[ 

cj 2, 

and 

O 2 : & 1 > k 2 or — > k. 2 - 

V-92 g 

On the other hand, the condition is not satisfied for fY : £ > k (Problem 6.34). 

Analogously, the acceptance region A : T 2 < CT 2 of the two-sided t-test for 
testing H : 9i = 0 against 0i ^ 0 is convex, and the test is admissible against 
fli : |^/cr 2 1 > k\ and Q .' 2 : |£/<r| > k 2 - ■ 
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In decision theory, a quite general method for proving admissibility consists 
in exhibiting a procedure as a unique Bayes solution. In the present case, this is 
justified by the following result, which is closely related to Theorem 3.8.1. 


Theorem 6.7.2 Suppose the set {x : fg(x) > 0} is independent of 8, and let a 
a-field be defined over the parameter space SI, containing both Oh and Ok and 
such that the densities fe(x) (with respect to p) of X are jointly measurable in 8 
and x. Let Ao and Ai be probability distributions over this a-field with Ao (Oh) = 
Ai(SIk) = 1 , and let 

hi(x) = J fg(x)dAi( 8 ). 


Suppose <po is a nonrandomized test of H against K defined by 



and that p{x : hi(x)/ho(x) = fc} = 0 . 

(i) Then po is d-admissible for testing H against K. 

(ii) Let sup nff Egipo(X) = a and oj = {8 : Egpo(X) = a}. If ui C Oh and 
Ao(oj) = 1, then <po is also a-admissible. 

(iii) If Ai assigns probability 1 to ST C Ok, the conclusions of (i) and (ii) 
apply with Of in place of Ok- 

Proof, (i): Suppose <p is any other test, satisfying (6.18) and (6.19) with O' = 
Ok- Then also 


E e <p(X)dAo{ 6 ) < 


EeMX) dA 0 {8) 


and 


J E el p{X) dA 1 ( 8 ) > 


E e g> 0 {X)dA 1 ( 8 ). 


By the argument of Theorem 3.8.1, these inequalities are equivalent to 
J <p(x)ho(x) dp(x) < J tpo(x)ho(x) dp(x) 

and 

J ip(x)hi(x) dp(x) > J (fio(x)hi(x) dp(x), 

and the hi(x ) (i = 0,1) are probability densities with respect to p. This con¬ 
tradicts the uniqueness of the most powerful test of ho against hi at level 
f ip(x)ho(x) dp(x). 

(ii) : By assumption, J Egipo{x) dAo( 8 ) = a, so that (po is a level-a test of ho- 
If ip is any other level-a test of H satisfying (6.18) with O' = Ok, it is also a 
level-a test of ho and the argument of part (i) can be applied as before. 

(iii) : This follows immediately from the proofs of (i) and (ii). ■ 


Example 6.7.13 (continued) In the two-sided normal problem of Example 
6.7.13 with H : £ = 0, K : £ ^ 0 consider the class 0' a b of alternatives (£,a) 
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satisfying 


2 


a 


1 

a + rj 2 ’ 


e 


a + r / 2 ’ 


—oo < r\ < oo 


(6.23) 


for some fixed a, b > 0, and the subset u>, of Qh of points (0, a 2 ) with a 2 < 1 /a. 
Let Ao,Ai be distributions over w and fi' a b defined by the densities [Problem 
6.35(i)[ 


Mv) 


Co 

(a + rf ) n / 2 


and 

(7- p(™/ 2)b 2 v 2 /( a +ri 2 ) 

x ^ = \ a + v2)n/2 


Straightforward calculation then shows [Problem 6.35(h)] that the densities ho 
and hi of Theorem 6.7.2 become 


ho(x) 


C 0 e" (a/2)E ^ 


and 


hi(x) 


Ci exp 


I J2 x i + 

“TOT" 


J 


so that the Bayes test ipo of Theorem 6.7.2 rejects when x 2 / JO x 2 > k and hence 
reduces to the two-sided f-test. 

The condition of part (ii) of the theorem is clearly satisfied so that the t-test 
is both d- and a-admissible against Q' a b . 

When dealing with invariant tests, it is of particular interest to consider admis¬ 
sibility against invariant classes of alternatives. In the case of the two-sided test 
ipo, this means sets Q' depending only on |£/oj. It was seen in Example 6.7.13 
that ipo is admissible against Q' : |£/oj > B for any B, that is, against distant 
alternatives, and it follows from the test being UMP unbiased or from Example 
6.7.13 (continued) that ipo, is admissible against Q' : |£/oj < A for any A > 0, 
that is, against alternatives close to H. This leaves open the question whether 
ipo is admissible against sets Q 1 : 0 < A < |£/oj < B < oo, which include nei¬ 
ther nearby nor distant alternatives. It was in fact shown by Lehmann and Stein 
(1953) that ipo is admissible for testing H against |£|/V = 5 for any <5 > 0 and 
hence that it is admissible against any invariant fl'. It was also shown there that 
the one-sided t-test of H : £ = 0 is admissible against £/cr = S' for any S' > 0. 
These results will not be proved here. The proof is based on assigning to log a 
the uniform density on (— N, N) and letting N —¥ oo, thereby approximating the 
“improper” prior distribution which assigns to log a the uniform distribution on 
(— 00 , 00 ), that is, Lebesgue measure. 

That the one-sided t-test (pi of H : £ < 0 is not admissible against all Q' is 
shown by Brown and Sackrowitz (1984), who exhibit a test p satisfying 


E^ :CT p(X) < E£ :tT ipi(X) for all 5 < 0, 0 < a < 00 


and 


E^ t<T p(X) > Et,cripi(X) for all 0 < £1 < £ < £2 < 00 , 0 < a < 00 . ■ 



238 


6. Invariance 


Example 6.7.14 (Normal variance) For testing the variance a 2 of a normal 
distribution on the basis of a sample Xi ,..., X n from iV(£, a 2 ), the Bayes ap¬ 
proach of Theorem 6.7.2 easily proves a-admissibility of the standard test against 
any location invariant set of alternatives Q', that is, any set S 2 ' depending only 
on a 2 . Consider first the one-sided hypothesis H : a < ao and the alternatives 
f Y : a = ai for any ai > ao. Admissibility of the UMP invariant (and unbiased) 
rejection region y~](X, — X ) 2 > C follows immediately from Section 3.9, where 
it was shown that this test is Bayes for a pair of prior distributions (Ao,Ai): 
namely, Ai assigning probability 1 to any point (£i,ai), and Ao putting a = ao 
and assigning to £ the normal distribution IV (£ i, (a 2 — a 2 )/n). Admissibility of 
^2(Xi — A ') 2 < C when the hypothesis is H : a > ao and Q' = {(£,a) : a = ai}, 
a i < ao, is seen by interchanging Ao and Ai, ao and ai. 

A similar approach proves a-admissibility of any size-a rejection region 

J2( x i ~ A') 2 < Ci or > C 2 (6.24) 

for testing H : a = ao against fi' : {a = ai} U (a = a 2 } (a i < ao < a 2 ). On 
fl h, where the only variable is £, the distribution Ao for £ can be taken as the 
normal distribution with an arbitrary mean and variance (a\ — ao)/n. On 12', 
let the conditional distribution of £ given a = a 2 assign probability 1 to the value 
£i, and let the conditional distribution of £ given a = ai be lV(fi, (a| — a 2 )/n). 
Finally, let Ai assign probabilities p and 1 — p to a = ai and a = a 2 , respectively. 
Then the rejection region satisfies (6.24), and any constants Ci and C 2 for which 
the test has size a can be attained by proper choice of p [Problem 6.36(i)[. ■ 

The results of Examples 6.7.13 and 6.7.14 can be used as the basis for proving 
admissibility results in many other situations involving normal distributions. The 
main new difficulty tends to be the presence of additional (nuisance) means. These 
can often be eliminated by use of the following lemma. 


Lemma 6.7.1 For any given a 2 and M 2 > a 2 there exists a distribution A CT 
such that 

no - f 1 

J v Z 7 r a 

is the normal density with mean zero and variance M 2 . 


Proof. Let 9 — (/a, and let 9 be normally distributed with zero mean and 
variance r 2 . Then it is seen [Problem 6.36(h)] that 


I(z) 


1 

, — , exp 

\/27r a V 1 + t 2 


2 a 2 (l+r 2 ) Z 


The result now follows by letting r 2 = (A/ 2 /a 2 ) — 1, so that a 2 (l + r 2 ) = A/ 2 . ■ 


Example 6.7.15 Let Xi,...,X m ; Yi,.... Y n be samples from N(£,a 2 ) and 
N(r],T 2 ) respectively, and consider the problem of testing H : r/a = 1 against 
r/a = A > 1. 

(i) Suppose first that £ = p = 0. If Ao and Ai assign probability 1 to the 
points (ao, 7 o = ao) and (ai,ri = Aai) respectively, the ratio hi/ho of Theorem 
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6.7.2 is proportional to 



and for suitable choice of critical value and <n < no, the rejection region of the 
Bayes test reduces to 

E% 2 „ A 2 <Ti — Oq 

E*? ' 

The values a 2 and a 2 can then be chosen to e this test any preassigned size a. 

(ii) If £ and rj are unknown, then X, Y, S\ = E(^» — X) 2 , Sy = E(E — ?) 2 
are sufficient statistics, and Sx and Sy can be represented as S\ = EET 
Si = Ej=i Vf, with the Ui, Vj independent normal with means 0 and variances 
a 2 and r 2 respectively. 

To a and r assign the distributions Ao and Ai of part (i) and conditionally, 
given a and r, let £ and r/ be independently distributed according to Ao CT , Ao T , over 
Qh and Ai ct , Ai t over Qk, with these four conditional distributions determined 
from Lemma 6.7.1 in such a way that 


yjrn _ 

-e 


\/27r<7o 


(m/2<Tg)(x-5) 2 


OCTQ 


(«) = / 


yjrn _ 
e 


v27T(T 1 


(m/ 2-?K--S) 2 dAo(Ti(C)i 


and analogously for r). This is possible by choosing the constant M 2 of Lemma 
6.7.1 greater than both oq an d c 2 . With this choice of priors, the contribution 
from x and y to the ratio hi/ho of Theorem 6.7.2 disappears, so that hi/ho 
reduces to the expression for this ratio in part (i), with E 2 -? an d E2/f replaced 
by EtA'i — *) 2 an d E [Vi ~ V) 2 respectively. ■ 


This approach applies quite generally in normal problems with nuisance means, 
provided the prior distribution of the variances <r 2 , r 2 , ... assigns probability 1 
to a bounded set, so that M 2 can be chosen to exceed all possible values of these 
variances. 

Admissibility questions have been considered not only for tests but also for 
confidence sets. These will not be treated here (but see Example 8.5.4); convenient 
entries to the literature are Cohen and Strawderman (1973) and Joshi (1982). For 
additional results, see Hooper (1982b) and Arnold (1984). 


6.8 Rank Tests 

One of the basic problems of statistics is the two-sample problem of testing the 
equality of two distributions. A typical example is the comparison of a treatment 
with a control, where the hypothesis of no treatment effect is tested against 
the alternatives of a beneficial effect. This was considered in Chapter 5 under 
the assumption of normality, and the appropriate test was seen to be based on 
Student’s t. It was also shown that when approximate normality is suspected 
but the assumption cannot be trusted, one is led to replacing the t -test by its 
permutation analogue, which in turn can be approximated by the original f-test. 
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We shall consider the same problem below without, at least for the moment, 
making any assumptions concerning even the approximate form of the underly¬ 
ing distributions, assuming only that they are continuous. The observations then 
consist of samples AT,, X m and Yi,..., Y n from two distributions with contin¬ 
uous cumulative distribution functions F and G, and the problem becomes that 
of testing the hypothesis 

Hi : G = F. 

If the treatment effect is assumed to be additive, the alternatives are G(y) = 
F(y — A). We shall here consider the more general possibility that the size of the 
effect may depend on the value of y (so that A becomes a nonnegative function 
of y) and therefore test Hi against the one-sided alternatives that the Y’s are 
stochastically larger than the A’s, 

AT : G(z) < F(z) for all 3, and G ± F. 

An alternative experiment that can be performed to test the effect of a treat¬ 
ment consists of the comparison of N pairs of subjects, which have been matched 
so as to eliminate as far as possible any differences not due to the treatment. 
One member of each pair is chosen at random to receive the treatment while the 
other serves as control. If the normality assumption of Section 5.10 is dropped 
and the pairs of subjects can be considered to constitute a sample, the observa¬ 
tions (Ai, Yi),..., (ATv, Yn) are a sample from a continuous bivariate distribution 
F. The hypothesis of no effect is then equivalent to the assumption that F is 
symmetric with respect to the line y = x: 

H 2 : F(x,y) = F(y, x). 

Another basic problem, which occurs in many different contexts, con¬ 
cerns the dependence or independence of two variables. In particular, if 
(Xi, Yi),..., (Xjv, Yn) is a sample from a bivariate distribution F, one will be 
interested in the hypothesis 

H 3 : F(x,y) = Gi(x)G 2 (y) 

that X and Y are independent, which was considered for normal distributions in 
Section 5.13. The alternatives of interest may, for example, be that X and Y are 
positively dependent. An alternative formulation results when x, instead of being 
random, can be selected for the experiment. If the chosen values are xi < ■ ■ ■ < 
xn and T) denotes the distribution of Y given Xi, the Y’s are independently 
distributed with continuous cumulative distribution functions A\,..., Fn- The 
hypothesis of independence of Y from x becomes 

Hi : Fi = ■ ■ ■ = Fn , 

while under the alternatives of positive regression dependence the variables Y 
are stochastically increasing with i. 

In these and other similar problems, invariance reduces the data so completely 
that the actual values of the observations are discarded and only certain order 
relations between different groups of variables are retained. It is nevertheless 
possible on this basis to test the various hypotheses in question, and the resulting 
tests frequently are nearly as powerful as the standard normal tests. We shall now 
carry out this reduction for the four problems above. 
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The two-sample problem of testing H i against AT remains invariant under the 
group G of all transformations 

x'i = p{xi), y'j = p{yj) (i = l,...,m, j = 1,..., n) 

such that p is continuous and strictly increasing. This follows from the fact 
that these transformations preserve both the continuity of a distribution and 
the property of two variables being either identically distributed or one being 
stochastically larger than the other. As was seen (with a different notation) in 
Example 6.2.3, a maximal invariant under G is the set of ranks 

(R , -,S') = (R' 1 ,...,R' m ;S' 1 ,...,S , n ) 

of Xi,..., X m ; Yi,..., Y n in the combined sample. Since the distribution of 
(R'i ,..., R ' m ; S[,..., S' n ) is symmetric in the first m and in the last n variables 
for all distributions F and G, a set of sufficient statistics for (R 1 , S') is the set of 
the A'-ranks and that of the Y-ranks without regard to the subscripts of the A’s 
and Y’s This can be represented by the ordered A-ranks and Y-ranks 

Ri < ■ ■ ■ < Rm and Si < • • ■ < S n , 

and therefore by one of these sets alone since each of them determines the other. 
Any invariant test is thus a rank test, that is, it depends only on the ranks of the 
observations, for example on (Si,..., S n ). 

That almost invariant tests are equivalent to invariant ones in the present 
context was shown first by Bell (1964). A streamlined and generalized version of 
his approach is given by Berk and Bickel (1968) and Berk (1970), who also show 
that the conclusion of Theorem 6.5.3 remains valid in this case. 

To obtain a similar reduction for H 2 , it is convenient first to make the trans¬ 
formation Zi = Yi — Xi, Wi — Xi + Yi. The pairs of variables ( Zi, Wi) are then 
again a sample from a continuous bivariate distribution. Under the hypothesis 
this distribution is symmetric with respect to the ui-axis, while under the al¬ 
ternatives the distribution is shifted in the direction of the positive 2 -axis The 
problem is unchanged if all the w’s are subjected to the same transformation 
w'i = A (wt), where A is 1 : 1 and has at most a finite number of discontinuities, 
and (Zi,..., Zn) constitutes a maximal invariant under this group. [Cf. Problem 
6.2(h).] 

The Z ’s are a sample from a continuous univariate distribution D, for which 
the hypothesis of symmetry with respect to the origin, 

H’ 2 : D(z) + D{-z) = 1 for all 2 , 

is to be tested against the alternatives that the distribution is shifted to¬ 
ward positive 2 -values This problem is invariant under the group G of all 
transformations 

z'i = p(zi) (i=l,...,N) 

such that p is continuous, odd, and strictly increasing. If 2^,..., Zi m < 0 < 
2jj ,..., Zj n , where ii < ■ ■ ■ < i m and ji < ■ ■ ■ < j„, let si,..., s' n denote the 
ranks of Zj i ,..., Zj n , among the absolute values 1 21 1 ,..., | zn \, and r[,... ,r' m the 
ranks of | Zi 1 1 ,..., \zi m \ among |2i|,..., | 2 jv|- The transformations p preserve the 
sign of each observation, and hence in particular also the numbers m and n. 
Since p is a continuous, strictly increasing function of 1 2|, it leaves the order of 
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the absolute values invariant and therefore the ranks r[ and s'j. To see that the 
latter are maximal invariant, let (zi, ..., zn ) and (z[, ..., z' N ) be two sets of points 
with m' = m, n' = n, and the same r[ and s'-. There exists a continuous, strictly 
increasing function on the positive real axis such that \z[\ = p{\zi\) and p(0) = 0. 
If p is defined for negative 2 by p(—z) = —p(z), it belongs to G and z\ = p{zi) 
for all i, as was to be proved. As in the preceding problem, sufficiency permits 
the further reduction to the ordered ranks n < • ■ ■ < r m and si < • • • < s n . This 
retains the information for the rank of each absolute value whether it belongs 
to a positive or negative observation, but not with which positive or negative 
observation it is associated. 

The situation is very similar for the hypotheses H3 and H4. The problem 
of testing for independence in a bivariate distribution against the alternatives 
of positive dependence is unchanged if the Xi and Yi are subjected to trans¬ 
formations X'i = p(A;),Y/ = A (Yi) such that p and A are continuous and 
strictly increasing. This leaves as maximal invariant the ranks (R [,..., R' N ) of 
(AT,... ,X jv) among the A’s and the ranks ( S[ ,..., S' N ) of (hi,..., Yn) among 
the Y’s. The distribution of (R[, S [),..., (R' N , S ’ N ) is symmetric in these N pairs 
for all distributions of ( X , Y). It follows that a sufficient statistic is (Si,..., Sn) 
where (1, Si),..., ( N , Sn) is a permutation of (R[, S [),..., ( R ' N , S' N ) and where 
therefore Si is the rank of the variable Y associated with the ith smallest X. 

The hypothesis H4 that Yi,..., Y n constitutes a sample is to be tested against 
the alternatives K4 that the Yi are stochastically increasing with i. This problem 
is invariant under the group of transformations y[ = p(yi) where p is continuous 
and strictly increasing. A maximal invariant under this group is the set of ranks 
Si,...,Sn of Yi, ..., Yn- 

Some invariant tests of the hypotheses Hi and H2 will be considered in the next 
two sections. Corresponding results concerning H3 and H4 are given in Problems 
6.60-6.62. 


6.9 The Two-Sample Problem 


The problem of testing the two-sample hypothesis H : G = F against the one¬ 
sided alternatives K that the Y’s are stochastically larger than the A’s is reduced 
by the principle of invariance to the consideration of tests based on the ranks 
Si < • • • < Sn, of the Y’s. The specification of the Si is equivalent to specifying 
for each of the N = m + n positions within the combined sample (the smallest, 
the next smallest, etc.) whether it is occupied by an * or a y. Since for any set of 
observations n of the N positions are occupied by y’s and since the (^) possible 
assignments of n positions to the y’s are all equally likely when G = F, the joint 
distribution of the Si under H is 


P{S 1 = si,...,S n 



(6.25) 


for each set 1 < si < S 2 < ■ ■ ■ < s„ < N. Any rank test of H of size 
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therefore has a rejection region consisting of exactly k points (si,..., s„). 

For testing H against K there exists no UMP rank test, and hence no UMP in¬ 
variant test. This follows for example from a consideration of two of the standard 
tests for this problem, since each is most powerful among all rank tests against 
some alternative. The two tests in question have rejection regions of the form 


h(s 1 ) + • • • + h(s n ) > C. 


(6.26) 


One, the Wilcoxon two-sample test, is obtained from (6.26) by letting h(s) = s, 
so that it rejects H when the sum of the y-ranks is too large. We shall show below 
that for sufficiently small A, this is most powerful against the alternatives that 
F is the logistic distribution F(x) = 1/(1 + e~ x ), and that G(y) = F(y— A). The 
other test, the normal-scores test, has the rejection region (6.26) with h(s) = 
E{W( S )), where Wm < ••• < W^), is an ordered sample of size N from a 
standard normal distribution. 5 This is most powerful against the alternatives 
that F and G are normal distributions with common variance and means £ and 
y = £ + A, when A is sufficiently small. 

To prove that these tests have the stated properties it is necessary to know 
the distribution of (Si,..., S„) under the alternatives. If F and G have densities 
/ and g such that / is positive whenever g is, the joint distribution of the S, is 
given by 


P{Si =si,...,S, 


Sn} = E 


J(V M ) 


9(VM) 

W'n)) 



(6.27) 


where Vm < • • • < V(jv) is an ordered sample of size N from the distribution F. 
(See Problem 6.42.) Consider in particular the translation (or shift) alternatives 


g(y) = f(y - A), 


and the problem of maximizing the power for small values of A. Suppose that / 
is differentiable and that the probability (6.27), which is now a function of A, can 
be differentiated with respect to A under the expectation sign. The derivative of 
(6.27) at A = 0 is then 


d 

8A 


Pa (Si = si,..., S n 


S n } 


A=0 




' /'(Ym) 



Since under the hypothesis the probability of any ranking is given by (6.25), it 
follows from the Neyman-Pearson lemma in the extended form of Theorem 3.6.1, 
that the derivative of the power function at A = 0 is maximized by the rejection 
region 


i=l 

The same test maximizes the power itself for sufficiently small A. To see this 
let s denote a general rank point (si,..., s„), and denote by the rank point 


m. 

m. 


d) 
>) j 


> c. 


(6.28) 


5 Tables of the expected order statistics from a normal distribution are given in 
Biometrika Tables for Statisticians , Vol. 2, Cambridge U. P., 1972, Table 9. For 
additional references, see David (1981, Appendix, Section 3.2). 
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giving the jth largest value to the left-hand side of (6.28). If 


a = k 


the power of the test is then 

k k 

m) = J2 p ^ sU) ) = J2 


3 = 1 


— _l A—Pa(s 

,( N ) + dA A{ ' 

3=1 l\nj 




A=0 


Since there is only a finite number of points s, there exists for each j a number 
A j > 0 such that the point also gives the jth largest value to Pa(s) for all 
A < Aj. If A is less than the smallest of the numbers 



the test also maximizes /3(A). 

If f{x) is the normal density 3 V(£,ct 2 ), then 

fix) d . . x — £ 

--jH = --r- log/I (x = 

f[x) dx a 1 

and the left-hand side of (6.28) becomes 

E«¥ = !e«(w 


where W(i) < • • • < W(n) is an ordered sample from 7V(0,1). The test that max¬ 
imizes the power against these alternatives (for sufficiently small A) is therefore 
the normal-scores test. 

In the case of the logistic distribution, 


and hence 


Fix) = 


1 + e~ 


/O) = 


(1 + e~ x ) 2 ’ 


fix) 

fix) 


2 Fix) - 1 . 


The locally most powerful rank test therefore rejects when ^ E[F(V( Xi -\)] > C. 
If V has the distribution F, then U = F(V) is uniformly distributed over (0,1) 
(Problem 3.22). The rejection region can therefore be written as ^2 EiU( Si )) > 
C, where U( i) < • • • < Ur jv) is an ordered sample of size N from the uniform 
distribution 17(0,1). Since i3([/( Si )) = SifN + 1), the test is seen to be the 
Wilcoxon test. 

Both the normal-scores test and the Wilcoxon test are unbiased against the 
one-sided alternatives K. In fact, let <j> be the critical function of any test deter¬ 
mined by (6.26) with h nondecreasing. Then <j> is nondecreasing in the -y’s and the 
probability of rejection is a for all F — G. By Lemma 5.9.1 the test is therefore 
unbiased against all alternatives of K. 

It follows from the unbiasedness properties of these tests that the most pow¬ 
erful invariant tests in the two cases considered are also most powerful against 
their respective alternatives among all tests that are invariant and unbiased. The 
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nonexistence of a UMP test is thus not relieved by restricting the tests to be un¬ 
biased as well as invariant. Nor does the application of the unbiasedness principle 
alone lead to a solution, as was seen in the discussion of permutation tests in Sec¬ 
tion 5.9. With the failure of these two principles, both singly and in conjunction, 
the problem is left not only without a solution but even without a formulation. 
A possible formulation (stringency) will be discussed in Chapter 8 . However, the 
determination of a most stringent test for the two-sample hypothesis is an open 
problem. 

For testing H : G = F against the two-sided alternatives that the X’s are either 
stochastically smaller or larger than the X’s two-sided versions of the rank tests 
of this section can be used. In particular, suppose that h is increasing and that 
h(s) + h(N+l — s) is independent of s, as is the case for the Wilcoxon and normal- 
scores statistics. Then under H, the statistic S h.(sj) is symmetrically distributed 
about nT,fLih(i)/N = ^i , and (6.26) suggests the rejection region 





1 

N 


n m 

m E h{ Sj ) - n E 

3 =1 i= 1 


> C. 


The theory here is still less satisfactory than in the one-sided case. These tests 
need not even be unbiased [Sugiura (1965)], and it is not known whether they 
are admissible within the class of all rank tests. On the other hand, the relative 
asymptotic efficiencies are the same as in the one-sided case. 

The two-sample hypothesis G = F can also be tested against the general 
alternatives G ^ F. This problem arises in deciding whether two products, two 
sets of data, or the like can be pooled when nothing is known about the underlying 
distributions. Since the alternatives are now unrestricted, the problem remains 
invariant under all transformations x' t = f(xi), y'j = f(yj), i = 1 ,..., m, j = 
1 ,...,n, such that / has only a finite number of discontinuities. There are no 
invariants under this group, so that the only invariant test is <f>(x, y) = a. This is 
however not admissible, since there do exist tests of FI that are strictly unbiased 
against all alternatives G ^ F (Problem 6.54). One of the tests most commonly 
employed for this problem is the Smirnov test. Let the empirical distribution 
functions of the two samples be defined by 

= S yi ,..., Vn (z)= b , 

m n 

where a and b are the numbers of x’s and y’s less or equal to z respectively. Then 
H is rejected according to this test when 


SUp I'S'cci ,...,cc m (-^) jSj/i ,...,y n (2)) | C. 

z 

Accounts of the theory of this and related tests are given, for example, in Durbin 
(1973), Serfling (1980), Gibbons and Chakraborti (1992) and Hajek, Sidak, and 
Sen (1999). 

Two-sample rank tests are distribution-free for testing H : G = F but not for 
the nonparametric: Behrens-Fisher situation of testing H : 77 = £ when the X’s 
and X’s are samples from F((x — £)/cr) and F((y — rfj/r) with <r, r unknown. A 
detailed study of the effect of the difference in scales on the levels of the Wilcoxon 
and normal-scores tests is provided by Pratt (1964). 
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6.10 The Hypothesis of Symmetry 


When the method of paired comparisons is used to test the hypothesis of no 
treatment effect, the problem was seen in Section 6.8 to reduce through invariance 
to that of testing the hypothesis 

H' 2 : D(z) + D(-z) = 1 for all z, 

which states that the distribution D of the differences Zi = Yj — Xi (i = 1,..., N) 
is symmetric with respect to the origin. The distribution D can be specified by 
the triple ( p , F, G) where 

P=P{Z< 0}, F(z) = P{\Z\<z\Z>0}, 

G(z) = P{Z <z\Z>0}, 

and the hypothesis of symmetry with respect to the origin then becomes 

H -,p=\,G = F. 

Invariance and sufficiency were shown to reduce the data to the ranks Si < 
• • • < Sn of the positive Z’s among the absolute values \Zi \,..., \Zn\- The proba¬ 
bility of Si «s si,..., S n = s n is the probability of this event given that there are 
n positive observations multiplied by the probability that the number of positive 
observations is n. Hence 


P{Sl = si, . . . , S n = Sn} 

f N 
n 


(1 - p) n p N "Pf,g{Si = si, ..., Sn = s„ | n} 


where the second factor is given by (6.27). Under H, this becomes 
P{Si = Sl, . . . , Sn = Sn} = 


for each of the 


N 


E 

n=0 



= 2 


N 


n-tuples (si,...,Sn) satisfying 1 < si < • • • < s„ < IV. Any rank test of 
size a = k/2 N therefore has a rejection region containing exactly k such points 
(si,..., Sn)- 

The alternatives K of a beneficial treatment effect are characterized by the 
fact that the variable Z being sampled is stochastically larger than some random 
variable which is symmetrically distributed about 0. It is again suggestive to 
use rejection regions of the form h(si) + • • • + h(s n ) > C, where however n is 
no longer a constant as it was in the two-sample problem, but depends on the 
observations. Two particular cases are the Wilcoxon one-sample test, which is 
obtained by putting h(s) = s, and the analogue of the normal-scores test with 
h.(s) = E{W( S )) where W( i) < • • • < W(iv) are the ordered values of |Vi|, ..., |Ujv|, 
the V’s being a sample from IV(0,1). The IT’s are therefore an ordered sample 
of size N from a distribution with density \j2/-ne~‘ w for w > 0. 

As in the two-sample problem, it can be shown that each of these tests is most 
powerful (among all invariant tests) against certain alternatives, and that they 
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are both unbiased against the class K. Their asymptotic efficiencies relative to 
the t-test for testing that the mean of Z is zero have the same values 3/n and 1 
as the corresponding two-sample tests, when the distribution of Z is normal. 

In certain applications, for example when the various comparisons are made 
under different experimental conditions or by different methods, it may be un¬ 
realistic to assume that the variables Zi,...,Zjv have a common distribution. 
Suppose instead that the Zi are still independently distributed but with arbi¬ 
trary continuous distributions Di. The hypothesis to be tested is that each of 
these distributions is symmetric with respect to the origin. 

This problem remains invariant under all transformations z\ = fi{zi) i = 
1,..., N, such that each /, is continuous, odd, and strictly increasing. A maxi¬ 
mal invariant is then the number n of positive observations, and it follows from 
Example 6.5.1 that there exists a UMP invariant test, the sign test, which rejects 
when n is too large. This test reflects the fact that the magnitude of the observa¬ 
tions or of their absolute values can be explained entirely in terms of the spread 
of the distributions Di, so that only the signs of the Z’s are relevant. 

Frequently, it seems reasonable to assume that the Z’s are identically dis¬ 
tributed, but the assumption cannot be trusted. One would then prefer to use 
the information provided by the ranks s< but require a test which controls the 
probability of false rejection even when the assumption fails. As is shown by the 
following lemma, this requirement is in fact satisfied for every (symmetric) rank 
test. Actually, the lemma will not require even the independence of the Z’s; it 
will show that any symmetric rank test continues to correspond to the stated 
level of significance provided only the treatment is assigned at random within 
each pair. 


Lemma 6.10.1 Let <j>(zi,. .., zn) be symmetric in its N variables and such that 

Ed<P{Zi, ..., Zjv) = a (6.29) 

when the Z’s are a sample from any continuous distribution D which is symmetric 
with respect to the origin. Then 

E(j>(Z\,..., Zn) = ot (6.30) 

if the joint distribution of the Z’s is unchanged under the 2 N transformations 
Z( = ±Zi,..., Z' N = ±Zn- 


Proof. The condition (6.29) implies 


E E 

( ... jv ) 


<p(±z jl , 


, ±Zj 


2 N ■ N\ 


= a a.e., 


(6.31) 


where the outer summation extends over all N\ permutations (ji, ■ ■ ■ ,jN) and 
the inner one over all 2 N possible choices of the signs + and —. This is proved 
exactly as was Theorem 5.8.1. If in addition <f> is symmetric, (6.31) implies 


E 


4>(±Zl, . . . , ±Zn) 
2 ^ 


(6.32) 


Suppose that the distribution of the Z’s is invariant under the 2 N transforma¬ 
tions in question. Then the conditional probability of any sign combination of 
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Zi ,..., Zn given \Zi\, ..., \Zn\ is 1/2^. Hence (6.32) is equivalent to 

E{(j>{Z 1 ,...,Z N )\\Z 1 \,...,\Z N \] = a a.e., (6.33) 

and this implies (6.30) which was to be proved. ■ 

The tests discussed above can be used to test symmetry about any known 
value 6q by applying them to the variables Zi — do . The more difficult problem 
of testing for symmetry about an unknown point d will not be considered here. 
Tests of this hypothesis are discussed, among others, by Antille, Kersting, and 
Zucchini (1982), Bhattacharya, Gastwirth, and Wright (1982), Boos (1982), and 
Koziol (1983). 

As will be seen in Section 11.3.1, the one-sample t-test is not robust against 
dependence. Unfortunately, this is also true-although to a somewhat lesser 
extent—of the sign and one-sample Wilcoxon tests [Gastwirth and Rubin (1971)]. 


6.11 Equivariant Confidence Sets 

Confidence sets for a parameter d in the presence of nuisance parameters $ were 
discussed in Chapter 5 (Sections 5.4 and 5.5) under the assumption that d is real¬ 
valued. The correspondence between acceptance regions A(6o) of the hypotheses 
H{do) : d = do and confidence sets S(x) for d given by (5.33) and (5.34) is, 
however, independent of this assumption; it is valid regardless of whether d is real¬ 
valued, vector-valued, or possibly a label for a completely unknown distribution 
function (in the latter case, confidence intervals become confidence bands for the 
distribution function). This correspondence, which can be summarized by the 
relationship 

d £ S(x) if and only if x £ A(d), (6.34) 

was the basis for deriving uniformly most accurate and uniformly most accurate 
unbiased confidence sets. In the present section, it will be used to obtain uniformly 
most accurate equivariant confidence sets. 

We begin by defining equivariance for confidence sets. Let G be a group 
of transformations of the variable A' preserving the family of distributions 
{Po, 0 , (0, ft) £ 12} an d let G be the induced group of transformations of 12. If 
g(d, i?) = (d',D'), we shall suppose that d' depends only on g and 9 and not on 
•d, so that g induces a transformation in the space of d. In order to keep the no¬ 
tation from becoming unnecessarily complex, it will then be convenient to write 
also d' = gd. For each transformation g £ G, denote by g* the transformation 
acting on sets S in #-space and defined by 

g*S = {gd:d£S}, (6.35) 

so that g*S is the set obtained by applying the transformation g to each point 9 of 
S. The invariance argument of Section 1.5, then suggests restricting consideration 
to confidence sets satisfying 

g*S(x) = S(gx) for all x £ X, g £ G. (6.36) 

We shall say that such confidence sets are equivariant under G. This terminology 
is preferable to the older term invariance which creates the impression that the 
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confidence sets remain unchanged under the transformation X' = gX. If the 
transformation g is interpreted as a change of coordinates, (6.36) means that 
the confidence statement does not depend on the coordinate system used to 
express the data. The statement that the transformed parameter g6 lies in S(gx) 
is equivalent to stating that 9 G g*^ 1 S(gx), which is equivalent to the original 
statement 9 G S(x) provided (6.36) holds. 

Example 6.11.1 Let X, Y be independently normally distributed with means 
£, r/ and unit variance, and let G be the group of all rigid motions of the plane, 
which is generated by all translations and orthogonal transformations. Here g = g 
for all g G G. An example of an equivariant class of confidence sets is given by 

S(x,y) = {(£,??) : (x - £) 2 + (y- rjf < C } , 

the class of circles with radius y/C and center (x,y). The set g*S(x,y) is the 
set of all points g(£,r/) with (£,??) G S(x,y ) and hence is obtained by subjecting 
S(x,y) to the rigid motion g. The result is the circle with radius y/C and center 
g(x,y), and (6.36) is therefore satisfied. ■ 

In accordance with the definitions given in Chapters 3 and 5, a class of con¬ 
fidence sets for 9 will be said to be uniformly most accurate equivariant at 
confidence level 1 — a if among all equivariant classes of sets S(x) at that level it 
minimizes the probability 

P g ,ti{6' G 5(A)} for all 6f ± 9. 

In order to derive confidence sets with this property from families of UMP in¬ 
variant tests, we shall now investigate the relationship between equivariance of 
confidence sets and invariance of the associated tests. 

Suppose that for each 9o there exists a group of transformations Go 0 which 
leaves invariant the problem of testing H(9o) : 9 = 9o, and denote by G the group 
of transformations generated by the totality of groups Go- 

Lemma 6.11.1 (i) Let S(x) be any class of confidence sets that is equivariant 
under G, and let A(9) = {x : 9 £ S(x)}; then the acceptance region A{9) is 
invariant under Go for each 9. 

(ii) If in addition, for each 9 q the acceptance region A(9o) is UMP invariant 
for testing H(9 q) at level a, the class of confidence sets S(x) is uniformly most 
accurate among all equivariant confidence sets at confidence level 1 — a. 

Proof, (i): Consider any fixed 9 , and let g G Go- Then 

gA(9) = {gx : 9 G <S'(a;)} = {x : 9 G S(g~ 1 x)} = {x : 9 G g*~ 1 S(x)} 

= {x : g9 G <S'(a;)} = {x : 9 G S(x)} = A(9). 

Here the third equality holds because S(x) is equivariant, and the fifth one 
because g G Go and therefore g9 = 9. 

(ii): If S'(x) is any other equivariant class of confidence sets at the prescribed 
level, the associated acceptance regions A'(9) by (i) define invariant tests of the 
hypotheses H(9). It follows that these tests are uniformly at most as powerful as 
those with acceptance regions A(9) and hence that 

PeA9‘ G 5(A)} < PeA9' G S' (A)} for all 9' + 9, 
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as was to be proved. ■ 

It is an immediate consequence of the lemma that if UMP invariant acceptance 
regions A(6) have been found for each hypothesis H{6) (invariant with respect 
to Ge), and if the confidence sets S(x) = {6 : x £ A(#)} are equivariant under G, 
then they are uniformly most accurate equivariant. 

Example 6.11.2 Under the assumptions of Example 6.11.1, the problem of 
testing £ = £o, y = r/o is invariant under the group G^ 0tV0 of orthogonal 
transformations about the point (£o,??o): 

X'-(o = an(X-$o)+a 12 (y-Vo), 

Y' — 1 70 = d 2 l(X — £o) + 0 . 22 ft — Vo), 

where the matrix ( ciij ) is orthogonal. There exists under this group a UMP 
invariant test, which has the acceptance region (Problem 7.8) 

(. X - Co ) 2 + (Y - Vo) 2 < C. 

Let Go be the smallest group containing the groups G^ tV , for all £, y. Since this is a 
subgroup of the group G of Example 6.11.1 (the two groups actually coincide, but 
this is immaterial for the argument), the confidence sets ( X — £) 2 + ( Y — y ) 2 < C 
are equivariant under Go and hence uniformly most accurate equivariant. ■ 


Example 6.11.3 Let Xi,,.., X n be independently normally distributed with 
mean £ and variance <r 2 . Confidence intervals for £ are based on the hypotheses 
H(£o) : £ = £o, which are invariant under the groups G ^ 0 of transformations 
X- — a(Xi — £ 0 ) + £0 (« ^ 0). The UMP invariant test of L7(£o) has acceptance 
region 


V(n- l)n|A-£ 0 | ^ r 
x/E(A'i - XT " ’ 
and the associated confidence intervals are 

X- A7 C -^ t ^E(^--E 2 <£<-Y+ (6- 37 ) 

The group G in the present case consists of all transformations g : X’ = aXi + 
b (o 0), which on £ induces the transformation g : £' = a£ + b. Application 
of the associated transformation g* to the interval (6.37) takes it into the set of 
points a£ + 6 for which £ satisfies (6.37), that is, into the interval with end points 


aX + b- 


HC 

sjn{n - 1) 


Y^Xi-xy 


aX -|- b ~h 


|o| C 

\Jn(n - 1 ) 


Y{Xi-xy 


Since this coincides with the interval obtained by replacing Xi in (6.37) with 
aXi + b, the confidence intervals (6.37) are equivariant under Go and hence 
uniformly most accurate equivariant . ■ 


Example 6.11.4 I 11 the two-sample problem of Section 6.9, assume the shift 
model in which the X’s and Y’s have densities f(x) and g(y) = f(y — A) respec¬ 
tively, and consider the problem of obtaining confidence intervals for the shift 
parameter A which are distribution-free in the sense that the coverage proba¬ 
bility is independent of the true /. The hypothesis H( Ao) : A = Ao can be 
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tested, for example, by means of the Wilcoxon test applied to the observations 
Xi, Yj — Ao, and confidence sets for A can then be obtained by the usual inversion 
process. The resulting confidence intervals are of the form D^) < A < -D( mn + i~k) 
where D(i) < • • • < D( m „) are the mn ordered differences Yj — Xi. [For details see 
Problem 6.52 and for fuller accounts nonparametric books such as Randles and 
Wolfe (1979), Gibbons and Chakraborti (1992) and Lehmann (1998).] By their 
construction, these intervals have coverage probability 1 — a, which is indepen¬ 
dent of /. However, the invariance considerations of Sections 6.8 and 6.9 do not 
apply. The hypothesis H{ Ao) is invariant under the transformations X[ = p(AT), 
Yj = p(Yj — Ao) + Ao with p continuous and strictly increasing, but the shift 
model, and hence the problem under consideration, is not invariant under these 
transformations. ■ 


6.12 Average Smallest Equivariant Confidence Sets 

In the examples considered so far, the invariance and equivariance properties of 
the confidence sets corresponded to invariant properties of the associated tests. 
In the following examples this is no longer the case. 


Example 6.12.1 Let AT, ..., X n , be a sample from 7V(£,<j 2 ), and consider the 
problem of estimating a 2 . 

The model is invariant under translations X' — X i + a, and sufficiency and 
invariance reduce the data to S 2 = )C(X; — A) 2 . The problem of estimating a 2 
by confidence sets also remains invariant under scale changes X\ = bXi, S' = bS, 
a' = ba (0 < b), although these do not leave the corresponding problem of 
testing the hypothesis a = ao invariant. (Instead, they leave invariant the family 
of these testing problems, in the sense that they transform one such hypothesis 
into another.) The totality of equivariant confidence sets based on S is given by 


S 2 


G A, 


where A is any fixed set on the line satisfying 


P<r =i ( £ A ] — 1 — a. 


S 2 


(6.38) 


(6.39) 


That any set a 2 £ S 2 • A is equivariant is obvious. Conversely, suppose that 
a 2 £ C(S 2 ) is an equivariant family of confidence sets for a 2 . Then C(S 2 ) must 
satisfy b 2 C(S 2 ) = C{b 2 S 2 ) and hence 

o 2 £C{S 2 ) if and only if £ ^C{S 2 ) = (7(1), 

which establishes (6.38) with A = C{ 1). 

Among the confidence sets (6.38) with A satisfying (6.39) there does not exist 
one that uniformly minimizes the probability of covering false values (Problem 
6.73). Consider instead the problem of determining the confidence sets that are 
physically smallest in the sense of having minimum Lebesgue measure. This re¬ 
quires minimizing j A dv subject to (6.39). It follows from the Neyman-Pearson 
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lemma that the minimizing A* is 

A* = {v \ p(v ) > G}, (6.40) 

where p(v) is the density of V = 1/S 2 when a = 1, and where G is determined 
by (6.39). Since p(v) is unimodal (Problem 6.74), these smallest confidence sets 
are intervals, aS 2 < a 2 < bS 2 . Values of a and b are tabled by Tate and Klett 
(1959), who also table the corresponding (different) values a ', b' for the uniformly 
most accurate unbiased confidence intervals a!S 2 < a 2 < b'S 2 (given in Example 
5.5.1). 

Instead of minimizing the Lebesgue measure f A dv of the confidence sets A, 
one may prefer to minimize the scale-invariant measure 



To an interval ( a,b ), (6.41) assigns, in place of its length b — a, its logarithmic 
length log 6 — logo = log( 6 /a). The optimum solution A** with respect to this 
new measure is again obtained by applying the Neyman Pearson lemma, and is 
given by 

A** = {v : vp(v) > C }, (6-42) 

which coincides with the uniformly most accurate unbiased confidence sets 
[Problem 6.75(i)]. 

One advantage of minimizing (6.41) instead of Lebesgue measure is that it 
then does not matter whether one estimates a or a 2 (or a r for some other power 
of r), since under (6.41), if (a,b) is the best interval for a, then ( a r ,b r ) is the 
best interval for cr r [Problem 6.75(ii)]. ■ 

Example 6.12.2 Let X t (i = l,...,r) be independently normally distributed 
as N(£, 1). A slight generalization of Example 6.11.2 shows that uniformly most 
accurate equivariant confidence sets for (£i,..., £ r ) exist with respect to the group 
G of all rigid transformations and are given by 

- &) 2 ^ c • ( 6 - 43 ) 

Suppose that the context of the problem does not possess the symmetry which 
would justify invoking invariance with respect to G, but does allow the weaker 
assumption of invariance under the group Go of translations X[ = Xi + m. The 
totality of equivariant confidence sets with respect to Go is given by 

(Xi — ,..., X r — £ r ) € A, (6.44) 

where A is any fixed set in r-space satisfying 

Pi 1 = =£ r =o((Xi ,..., X r ) £ A) = 1 — a. (6.45) 

Since uniformly most accurate equivariant confidence sets do not exist (Prob¬ 
lem 6.73), let us consider instead the problem of determining the confidence 
sets of smallest Lebesgue measure. (This measure is invariant under Go-) This is 
given by (6.40) with v = (vi,..., v r ) and p(v) the density of (X\, ..., X r ) when 

= • • ■ = £ r :te 0, and hence coincides with (6.43). 

Quite surprisingly, the confidence sets (6.43) are inadmissible if and only if 
r > 3. A further discussion of this fact and references are deferred to Example 
8.5.4. ■ 
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Example 6.12.3 In the preceding example, suppose that the X, are distributed 
as N(£i, a 2 ) with a 2 unknown, and that a variable S 2 is available for estimating 
a 2 . Of S 2 assume that it is independent of the X’s and that S 2 /cr 2 has a \ 2 
-distribution with / degrees of freedom. 

The estimation of (£i,... ,£ r ) by confidence sets on the basis of X’s and S 2 
remains invariant under the group Go of transformations 

X' i =bX i + a i , S' = bS, £' = && + «;, cr' = 6cr, 


and the most general equivariant confidence set is of the form 

X r-HA € A 


= 1 — a. 


S S 

where A is any fixed set in r-space satisfying 

p ,,'Xi X r 

^Hl=-=lir=0 


£ A 


S’’ S 

The confidence sets (6.46) can be written as 

( 6 , ■•■,&•)€ {X u ...,X r )-SA, 


(6.46) 


(6.47) 


(6.48) 


where — SA is the set obtained by multiplying each point of A by the scalar —S. 

To see (6.48), suppose that C(X i,..., X r ; S ) is an equivariant confidence set 
for (£i,... ,£r). Then the r-dimensional set C must satisfy 


C(bX i + or,..., bX r + a r ; bS) = b[C(X 1 ,..., X r ; S)] + (m,..., o r ) 


for all oi,..., a r and all b > 0. It follows that (£i,..., £ r ) £ C if and only if 


(Xi -Cl (AT,..., X r ) — C(Xi ,..., X r - S) 

y s ’■■■’ s ) s 


G(0,...,0;1) 

A. 


The equivariant confidence sets of smallest volume are obtained by choosing for 
A the set A* given by (6.40) with v = (vi,... ,v r ) and p(v) the joint density of 
(Xi/S ,..., Xr/S) when =. • • • = j= 0. This density is a decreasing function 
of ^ v 2 (Problem 6.76), and the smallest equivariant confidence sets are therefore 
given by 

^(AT-&) 2 <CS 2 . (6.49) 

[Under the larger group G generated by all rigid transformations of (Xi,..., X. r ) 
together with the scale changes X[ = 6X,, S' = bS, the same sets have the 
stronger property of being uniformly most accurate equivariant; see Problem 
6.77.] ■ 


Examples 6.12.1-6.12.3 have the common feature that the equivariant confi¬ 
dence sets S(X) for 9 = (Oi,... ,0 r ) are characterized by an r-valued pivotal 
quantity, that is, a function h(X,8) = (hi (X, 8 ),..., h r (X, 9)) of the observa¬ 
tions A' and parameters 8 being estimated that has a fixed distribution, and such 
that the most general equivariant confidence sets are of the form 


h(X, 8) £ A 


(6.50) 
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for some fixed set A. 6 When the functions hi are linear in 9, the confidence sets 
C(X) obtained by solving (6.50) for 9 are linear transforms of A (with random 
coefficients), so that the volume or invariant measure of C(A') is minimized by 
minimizing 


p(v 1, . . . , V r ) dv 1 . . . dVr 


(6.51) 


for the appropriate p. The problem thus reduces to that of minimizing (6.51) 
subject to 


Pe 0 {h(X,9 0 ) €A} 



. . . , V r ) dv 1 . . . dVr 


= 1 — a, 


(6.52) 


where p(v i,..., v r ) is the density of the pivotal quantity h(X , 9). The minimizing 
A is given by 


A* 


f p(v Vr) 

l ' P(Vl,...,V r ) 



(6.53) 


with C determined by (6.52). 

The following is one more illustration of this approach. 


Example 6.12.4 Let AT, ..., X m and Y \...., Y n be samples from iV(£, a 2 ) and 
N(r],T 2 ) respectively, and consider the problem of estimating A = r 2 /a 2 . Suffi¬ 
ciency and invariance under translations X[ = AT + ai, Yj = Yj + a 2 reduce the 
data to Sx = 5 Z(Xi, —X-) 2 and Sy = YliXj — Y) 2 . The problem of estimating A 
also remains invariant under the scale changes 

Xl = biXi, Yj = b 2 Yj, 0 < 6 i, b 2 < oo, 

which induce the transformations 

Sx = biSx, Sy = b 2 Sy, a' = bia, t' = b 2 r. (6.54) 

The totality of equivariant confidence sets for A is given by X/V £ A , where 
V = Sy / Sx and A is any fixed set on the line satisfying 

Pa=i € Aj = 1 — a. (6.55) 

To see this, suppose that C(Sx, Sy) are any equivariant confidence sets for A. 
Then C must satisfy 

7 2 

C(b 1 S x ,b 2 Sy) = C(Sx,Sy ), (6.56) 

and hence A £ C(Sx, Sy) if and only if the pivotal quantity V/ A satisfies 

V = S sT e % c{Sx ’ 5y) = c(1 ’ 1] = 

As in Example 6.12.1, one may now wish to choose A so as to minimize either 
its Lebesgue measure f A dv or the invariant measure f A (l/v)dv. The resulting 


6 More general results concerning the relationship of equivariant confidence sets and 
pivotal quantities are given in Problems 6.69-6.72. 
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confidence sets are of the form 

p(v) > C and vp(v) > C (6.57) 

respectively. In both cases, they are intervals V/b < A < V/a [Problem 6.78(i)[. 
The values of a and b minimizing Lebesgue measure are tabled by Levy and 
Narula (1974); those for the invariant measure coincide with the uniformly most 
accurate unbiased intervals [Problem 6.78(h)]. ■ 


6.13 Confidence Bands for a Distribution Function 

Suppose that A' = (Xi,. .., X n ) is a sample from an unknown continuous cumu¬ 
lative distribution function F. and that lower and upper bounds Lx and Mx are 
to be determined such that with preassigned probability 1 — a the inequalities 

Lx(u) < F(u) < M x {u) for all u 

hold for all continuous cumulative distribution functions F. This problem is 
invariant under the group G of transformations 

Xl = g(Xi), i = 

where g is any continuous strictly increasing function. The induced transforma¬ 
tion in the parameter space is gF = F(g~ 1 ). 

If S(x) is the set of continuous cumulative distribution functions 

S(x) = {F : L x (u) < F(u) < M x {u) for all u}, 

then 


g*S(x) = {gF : L x (u ) < F(u) < M x (u ) for all u} 

= (F : i x [fl _1 («)] < F ( u ) < for ali w l- 

For an ecjuivariant procedure, this must coincide with the set 

S{gx) = {F : T 9 ( Xl ),..., g ( Xn) ( u) < F(u) < M g(xi)t ... Mxn) (u) for all u} . 

The condition of equivariance is therefore 

Lg(x 1 ),...,g(xn)[g{ U )] = L x (u), 

Mj(a,i).J(*„)[j(u)] = M x (u) for all x and u. 

To characterize the totality of equivariant procedures, consider the empirical 
distribution function (EDF) T x given by 

'i 

T x (u) = — for X(i) <u< *(i+i), i = 0,... ,n, 

where *(i) < • • • < X( n ) is the ordered sample and where X(o) = — oo, £(„+i) = oo. 
Then a necessary and sufficient condition for L and M to satisfy the above 
equivariance condition is the existence of numbers ao..... a n ; a' 0 ,...,a' n such 
that 

L x (u) = ai, M x (u) = a'i for aj(») < u < aj( i+ i). 
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That this condition is sufficient is immediate. To see that it is also necessary, let 
u, u' be any two points satisfying X(i) < u < u' < *(i+i). Given any yi,...,y n 
and v with yu\ < v < 2 /(;+i), there exist g, g' £ G such that 

a(y(i)) = g(y(i)) = %(i), g{v) = u, g'(v) = u. 

If L x , M x are equivariant, it then follows that L x (u') = L y (v) and L x (u) = 
L y (v), and hence that L x {u') = L x (u) and similarly M x (u') = M x (u), as was to 
be proved. This characterization shows L x and M x to be step functions whose 
discontinuity points are restricted to those of T x . 

Since any two continuous strictly increasing cumulative distribution functions 
can be transformed into one another through a transformation g, it follows that all 
these distributions have the same probability of being covered by an equivariant 
confidence band. (See Problem 6.84.) Suppose now that F is continuous but 
no longer strictly increasing. If I is any interval of constancy of F, there are 
no observations in I, so that I is also an interval of constancy of the sample 
cumulative distribution function. It follows that the probability of the confidence 
band covering F is not affected by the presence of I and hence is the same for 
all continuous cumulative distribution functions F. 

For any numbers cu , o( let A;, A' be determined by 

i A / ^ a ' 

CLi — CLi — ^i 

n n 

Then it was seen above that any numbers Ao,..., A n ; Aq, ..., A' n define a con¬ 
fidence band for F, which is equivariant and hence has constant probability of 
covering the true F. From these confidence bands a test can be obtained of the 
hypothesis of goodness of fit F = Fo that the unknown F equals a hypothetical 
distribution Fo . The hypothesis is accepted if Fo ties entirely within the band, 
that is, if 


-A i < F 0 (u) - T x (u) < A' 
for all x(i) < u < £(i+i) and all i = 1 ,..., n. 

Within this class of tests there exists no UMP member, and the most common 
choice of the A’s is A., = A' = A for all i. The acceptance region of the resulting 
Kolmogorov-Smirnov test can be written as 

sup |-Fo(m) — T x {u)\ < A. (6.58) 

— oo<u<oo 


Tables of the null distribution of the Kolmogorov-Smirnov statistic are given 
by Birnbaum (1952). For large n, approximate critical values can be obtained 
from the limit distribution K of y^nsup |Fo(u) — ^(tOli due to Kolmogorov and 
tabled by Smirnov (1948). Derivations of K can be found, for example, in Feller 
(1948), Billingsley (1968), and Hajek, Sidak and Sen (1999). The large sample 
properties of this test will be studied in Example 11.2.12 and Section 14.2. The 
more general problem of testing goodness-of-fit will be presented in Chapter 14. 
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6.14 Problems 

Section 6.1 

Problem 6.1 Let G be a group of measurable transformations of ( X , A) leaving 
V = {Pe, 0 £ SI} invariant, and let T(x) be a measurable transformation to (T, B). 
Suppose that T(x 1 ) = T(x 2 ) implies T(gxi) = T(gx 2 ) for all g £ G, so that G 
induces a group G* on T through g*T(x) = T(gx), and suppose further that 
the induced transformations g* are measurable B. Then G* leaves the family 
V T = {Pff ,8 £ 12} of distributions of T invariant. 

Section 6.2 

Problem 6.2 (i) Let X be the totality of points x = (xi,... ,x n ) for which 

all coordinates are different from zero, and let G be the group of trans¬ 
formations x'i = cxi,c > 0. Then a maximal invariant under G is 
(sgn x„, xi/xn, ■ ■ ■, Xn-i/xn) where sgn a: is 1 or —1 as x is positive or 
negative. 

(ii) Let X be the space of points x = (xi ,... ,x n ) for which all coordinates 
are distinct, and let G be the group of all transformations x\ = f(xi),i = 
1 ,... ,n, such that / is a 1 : 1 transformation of the real line onto itself 
with at most a finite number of discontinuities. Then G is transitive over 
A. 

[(ii): Let x = (xi,... ,x n ) and x' = (x[,..., x' n ) be any two points of X. Let 
I\,.... I rl be a set of mutually exclusive open intervals which (together with 
their end points) cover the real line and such that Xj £ Ij. Let I [,..., I' n be a 
corresponding set of intervals for x[,... ,x' n . Then there exists a transformation 
/ which maps each Ij continuously onto maps Xj into x'j, and maps the set 
of n — 1 end points of I\,.... I n onto the set of end points of , /(,.] 

Problem 6.3 Suppose M is any m x p matrix. Show that M T M is positive 
semidehnite. Also, show the rank of M T M equals the rank of M, so that in 
particular M T M is nonsingular if and only if m > p and M is of rank p. 

Problem 6.4 (i) A sufficient condition for (6.8) to hold is that D is a normal 

subgroup of G. 

(ii) If G is the group of transformations x' = ax + b, a yf 0, —oo < b < oo, then 
the subgroup of translations x' = x + b is normal but the subgroup x' = ax 
is not. 

[The defining property of a normal subgroup is that given d £ D, g £ G, there 
exists d' £ D such that gd = d'g. The equality s(a:i) = s(x 2 ) implies £2 = dx 1 
for some d £ D, and hence ex 2 = edx\ = d'ex 1 . The result (i) now follows, since 
s is invariant under D.\ 


Section 6.3 

Problem 6.5 Prove statements (i)-(iii) of Example 6.3.1. 
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Problem 6.6 Prove Theorem 6.3.1 

(i) by analogy with Example 6.3.1, and 

(ii) by the method of Example 6.3.2. [Hint: A maximal invariant under G is the 
set {gix,.. .,g N x}. 


Problem 6.7 Consider the situation of Example 6.3.1 with n = 1, and suppose 
that / is strictly increasing on (0,1). 

(i) The likelihood ratio test rejects if X < a/2 or X > 1 — a/2. 

(ii) The MP invariant test agrees with the likelihood ratio test when / is convex. 

(iii) When / is concave, the MP invariant test rejects when 


1 

2 


a ,, 1 a 

~2 <X< 2 + 2’ 


and the likelihood ratio test is the least powerful invariant test against both 
alternatives and has power < a. 


Problem 6.8 Let X, Y have the joint probability density f(x,y). Then the in¬ 
tegral h(z) = f/° f{y — z, y)dy is finite for almost all z, and is the probability 
density of Z = Y — X. 

[Since P{Z < b} — f/ h{z)dz, it is finite and hence h is finite almost 
everywhere.] 


Problem 6.9 (i) Let X = (Xi,..., A'„) have probability density (l/# n )/[(a;i — 

£)/#,..., ( x„ — £)/#], where — oo < £ < oo, 0 < 6 are unknown, and where 
/ is even. The problem of testing f = f 0 against f = f 1 remains invariant 
under the transformations x'i = axi + b (i = 1 ,..., n), a ^ 0, —oo < b < oo 
and the most powerful invariant test is given by the rejection region 


p oo poo 

/ / V n ~ 2 fl{vxi + U, . . . , VXn + u) dv du 

J — oo J 0 

POO POO 

>C / / V n ~ 2 fo(vXl + U, . . . , VXn + it) dv du. 

J — oo J 0 


(ii) Let X = (Xi ,..., X n ) have probability density f(xi— ; x n — 

’Y/l/j- 1 Wnj/3j) where k < n, the w’s are given constants, the matrix 
(wij) is of rank fc, the /3’s are unknown, and we wish to test f = f 0 
against f = fi- The problem remains invariant under the transforma¬ 
tions x'i = Xi + E j =1 Wijjj, —oo < 71 ,..., 7 fc < oo, and the most powerful 
invariant test is given by the rejection region 

,[■■■/ Mxi - E m jfo,. ..,x n -J2 w„j(3j)d/3i,. ,.,d(3 k ^ 

J ■■■ I fo(xi - J2 wijPj,- w nj /3j)df3 1 ,. • •, dfdk 

[A maximal invariant is given by y = 


xi 


- 12 ai r 

r=n— fc+1 


X2 


- ^2 a 2 r Xr , 

r=n-k -\-1 


i — k ^ ^ CLn — k,rXr 

r=n-k -\-1 


for suitably chosen constants a,i r .] 
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Problem 6.10 Let Xi ,..., X m \Yi ,..., Y n be samples from exponential dis¬ 
tributions with densities for cr _1 e _ ^ _ ^^ CT , for x > £, and r ~ 1 e~ < ' y ~ n ^ T for 
V > V- 

(i) For testing t/ct < A against t/o > A, there exists a UMP invariant test 
with respect to the group G : X[ = aXi + b, Y- = aYj + c,a > 0, —oo < 
b,c < oo, and its rejection region is 

E [Vi -min(i/i,...,jM)] ^ 

E[*» ~ m in(xi,..., Im)] 

(ii) This test is also UMP unbiased. 

(iii) Extend these results to the case that only the r smallest X’s and the s 
smallest Y’s are observed. 

[(ii): See Problem 5.15.] 


Problem 6.11 If Xi ,..., X n and Yi,... ,Y n are samples from N(£,cr 2 ) and 
2 ) respectively, the problem of testing r 2 = <r 2 against the two-sided 
alternatives r 2 ^ cr 2 remains invariant under the group G generated by the 
transformations X[ = aXi + b, Y[ = alj + c, (a ^ 0), and X[ = V), Y( = Xi. 
There exists a UMP invariant test under G with rejection region 


W 


max 


Em - E 2 TXXi = xf\ > 

E {Xi = xy E {Yi-Yy J - 


[The ratio of the probability densities of W for t 2 /(j 2 = A and r 2 /cr 2 = 1 is 
proportional to [(1 + w )/{A + w )]" _1 + [(1 + w)/{l + Aw )]" -1 for w > 1. The 
derivative of this expression is > 0 for all A.] 


Problem 6.12 Let Xi, ..., X n be a sample from a distribution with density 




where f(x) is either zero for * < 0 or symmetric about zero. The most powerful 
scale-invariant test for testing H : / = /o against K : / = J\ rejects when 


f~v n 1 fi(vxi) ... fi(vx n ) dv 
/ 0 °° v n ~ 1 fo(vxi) ... fo(vx„) dv 


Problem 6.13 Normal vs. double exponential. For fo{x) = e x 

fi(x) = e~' x ' / 2 , the test of the preceding problem reduces to rejecting when 

x/M/EM<c. 


(Hogg, 1972.) 

Note. The corresponding test when both location and scale are unknown 
is obtained in Uthoff (1973). Testing normality against Cauchy alternatives is 
discussed by Franck (1981). 


Problem 6.14 Uniform vs. triangular. 
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(i) For fo(x) = 1 (0 < x < 1), fi{x) = 2x (0 < x < 1), the test of Problem 
6.12 reduces to rejecting when T = X( n )/x < C. 

(ii) Under /o, the statistic 2nlogT is distributed as xi «• 

(Quesenberry and Starbuck, 1976.) 

Problem 6.15 Show that the test of Problem 6.9(i) reduces to 

(i) [*(„) — xn\\/S < c for normal vs. uniform; 

(ii) [x — X(i)\/S < c for normal vs. exponential; 

(iii) [x — X(\)\/\x(n) — *(!)] < c for uniform vs. exponential. 

(Uthoff, 1970.) 

Note. When testing for normality, one is typically not interested in distin¬ 
guishing the normal from some other given shape but would like to know more 
generally whether the data are or are not consonant with a normal distribution. 
This is a special case of the problem of testing for goodness of fit, which is briefly 
discussed at the end of Section 6.13 and forms the topic of Chapter 14; also, see 
the many references in the notes to Chapter 14. 

Problem 6.16 Let AT,..., X n be independent and normally distributed. Sup¬ 
pose Xi has mean pi and variance a 2 (which is the same for all i). Consider 
testing the null hypothesis that pa = 0 for all i. Using invariance considerations, 
find a UMP invariant test with respect to a suitable group of transformations in 
each of the following cases: 

(i) . a 2 is known and equal to one. 

(ii) . a 2 is unknown. 


Section 6.4 

Problem 6.17 (i) When testing H : p < po against K : p > po by means 

of the test corresponding to (6.13), determine the sample size required to 
obtain power f3 against p = pi, a = .05, j3 = .9 for the cases po = .1, 
pi = .15, .20, .25; p 0 = .05, pi = .10, .15, .20, .25; p 0 = .01, pi = .02, .05, 
.10, .15, .20. 

(ii) Compare this with the sample size required if the inspection is by attributes 
and the test is based on the total number of defectives. 

Problem 6.18 Two-sided t-test. 

(i) Let Xi, ..., X n be a sample from N(£, a 2 ). For testing 5 = 0 against 5^0, 
there exists a UMP invariant test with respect to the group X' = cA), 
c^0, given by the two-sided t-test (5.17). 

(ii) Let Xi,...,A' m , and Yi,...,Y n be samples from N(^,o 2 ) and N(r],o 2 ) 
respectively. For testing rj = 5 against p ^ 5 there exists a UMP invariant 
test with respect to the group X' = aXi + 6 , Y- = ciYj +b,a ^ 0, given by 
the two-sided t-test (5.30). 
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[(i): Sufficiency and invariance reduce the problem to |i|, which in the notation 
of Section 4 has the probability density p8(t) + ps(—t) for t > 0. The ratio of 
this density for 8 = <5i to its value for 8 = 0 is proportional to fo’°(e Slv + 
e ~ Slv )g t 2 (v) dv, which is an increasing function of t 2 and hence of |f|.] 

Problem 6.19 Testing a correlation coefficient. Let (Xi, Yi),..., (X n , Y n ) be a 
sample from a bivariate normal distribution. 

(i) For testing p < po against p > po there exists a UMP invariant test with 
respect to the group of all transformations X' = aXi + b, Y' = cYi + d for 
which a, c > 0. This test rejects when the sample correlation coefficient R 
is too large. 

(ii) The problem of testing p = 0 against p 0 remains invariant in ad¬ 
dition under the transformation Y' = — Y t , X' = Xi. With respect to the 
group generated by this transformation and those of (i) there exists a UMP 
invariant test, with rejection region |f?| > C. 

[(i): To show that the probability density p P (r) of R has monotone likelihood 
ratio, apply the condition of Problem 3.27(i), to the expression 5.87 given for 
this density. Putting t = pr + 1, the second derivative d 2 \ogp p (r)/dpdr up to a 
positive factor is 

OO 

i,j=o 

T°° l 2 ' 

2 E at 1 

i= 0 

To see that the numerator is positive for all t > 0, note that it is greater than 

OO OO 

2 ^Cif “ 2 c it 3 [{j+ + 

i= 0 j=i+1 

Holding i fixed and using the inequality Cj+i < \cj , the coefficient of tJ in the 
interior sum is > 0 .] 

Problem 6.20 For testing the hypothesis that the correlation coefficient p of a 
bivariate normal distribution is < po, determine the power against the alternative 
p — pi, when the level of significance a is .05, po = .3, pi = .5, and the sample 
size n is 50,100, 200. 


Section 6.5 

Problem 6.21 Almost invariance of a test 4> with respect to the group G of ei¬ 
ther Problem 6.10(i) or Example 6.3.4 implies that </> is equivalent to an invariant 
test. 

Problem 6.22 The totality of permutations of K distinct numbers ai,..., ok, 
for varying ai,... ,Ok can be represented as a subset Ck of Euclidean A'-space 
Rk, and the group G of Example 6.5.1 as the union of C 2 , C 3 , ... . Let v be the 
measure over G which assigns to a subset B of G the value EfcL 2 Tk{B n Ck), 
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where pk denotes Lebesgue measure in Ek■ Give an example of a set B C G 
and an element g £ G such that u(B ) > 0 but v(Bg) = 0. 

[If a, b, c, d are distinct numbers, the permutations g, g' taking (a, b ) into ( b , a) 
and ( c,d ) into (d,c) respectively are points in G 2 , but gg' is a point in C4.] 


Section 6.6 

Problem 6.23 Show that 

(i) Gi of Example 6.6.11 is a group; 

(ii) the test which rejects when X\ x /X\^ > C is UMP invariant under Gi; 

(iii) the smallest group containing Gi and G 2 is the group G of Example 6.6.11. 

Problem 6.24 Consider a testing problem which is invariant under a group G 
of transformations of the sample space, and let C be a class of tests which is 
closed under G, so that <fi £ C implies (fig £ C, where (fig is the test defined by 
(fig(x) = (fi(gx). If there exists an a.e. unique UMP member (fio of C, then (fio is 
almost invariant. 


Problem 6.25 Envelope power function. Let S(a) be the class of all level-a tests 
of a hypothesis H, and let /3*($) be the envelope power function, defined by 

/3*(6>) = sup MO), 

<tes(a) 

where (Hg, denotes the power function of <fi. If the problem of testing H is invariant 
under a group G, then /3*($) is invariant under the induced group G. 


Problem 6.26 


(i) A generalization of equation (6.1) is 


[ f(x)d.P e (x) = [ 

J A J gA 


f(g 1 x)dP s g(x). 


(ii) If Pg 1 is absolutely continuous with respect to Pg 0 , then Pgg 1 is absolutely 
continuous with respect to Pge 0 and 


dPg 1 

dPg 0 


(x) 


dPgQ^ 

dP f 


g8o 


{gx) 


(a.e. Pg 0 ). 


(iii) The distribution of dPg 1 /dPg 0 (X) when X is distributed as Pg 0 is the same 
as that of dPgg 1 /dPgg 0 (X l ) when X' is distributed as Pgg 0 . 


Problem 6.27 Invariance of likelihood ratio. Let the family of distributions V = 
{Pg,6 £ fl} be dominated by p, let pg = dPg/dp , let pg^ 1 be the measure 
defined by pg^ 1 (A) = /i[g ,_1 (A)], and suppose that p is absolutely continuous 
with respect to pg for all g £ G. 

(i) Then 

dp 

dpg- 1 


Pe(x) =pg 9 {gx) 


{gx) (a.e. p). 
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(ii) Let fl and u> be invariant under G, and countable. Then the likelihood ratio 
suPq po{x)/ sup„ pe(x) is almost invariant under G. 

(iii) Suppose that pe(x) is continuous in 9 for all x, that fl is a separable pseu¬ 
dometric space, and that fl and u> are invariant. Then the likelihood ratio 
is almost invariant under G. 


Problem 6.28 Inadmissible likelihood-ratio test. In many applications in which 
a UMP invariant test exists, it coincides with the likelihood-ratio test. That this 
is, however, not always the case is seen from the following example. Let Pi,,.., P n 
be n equidistant points on the circle x 2 + y 2 = 4, and Q i ,..., Q n on the circle 
x 2 + y 2 = 1. Denote the origin in the ( x,y ) plane by O, let 0 < a < \ be fixed, 
and let (A', Y) be distributed over the 2n + 1 points Pi,..., P n , Qi, ■ ■ ■, Q n , O 
with probabilities given by the following table: 



Pi 

Qi 

O 

H 

a/n 

(1 — 2a) jn a 

I< 

Pi/n 

0 

(n — 1 )/n 


where = 1- The problem remains invariant under rotations of the plane by 
the angles 2kn/n (k = 0,1,..., n— 1). The rejection region of the likelihood-ratio 
test consists of the points Pi,..., P„, and its power is 1/n. On the other hand, 
the UMP invariant test rejects when A' = Y = 0, and has power (n — 1 )/n. 


Problem 6.29 Let G be a group of transformations of X, and let A be a cr-held 
of subsets of X, and p a measure over (X. A). Then a set A £ A is said to be 
almost invariant if its indicator function is almost invariant. 


(i) The totality of almost invariant sets forms a cr-held Tlo, and a critical 
function is almost invariant if and only if it is Tlo-measurable. 

(ii) Let V = {Pe,9 £ fl} be a dominated family of probability distributions 
over (X, A), and suppose that g9 = 9 for all g £ G, 9 £ fl. Then the cr-held 
„4 q of almost invariant sets is sufficient for V. 


[Let A = c iPsi, be equivalent to V. Then 


dP e 

d\ 


(gx) 


dPg~ 10 

J2adP g -i Si 


(x) = 



(a.e. A), 


so that dPe/dX is almost invariant and hence ^lo-measurable.] 


Problem 6.30 The UMP invariant test of Problem 6.13 is also UMP similar. 

[Consider the problem of testing a = 0 vs. a > 0 in the two-parameter 
exponential family with density 

(7(0,1-) exp ^ ^2 x i ~ ~ l Xi l) ’ 0 < a < 1.] 

Note. For the analogous result for the tests of Problem 6.14, 6.15, see 
Quesenberry and Starbuck (1976). 


Problem 6.31 The following UMP unbiased tests of Chapter 5 are also UMP 
invariant under change in scale: 
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(i) The test of g < go in a gamma distribution (Problem 5.30). 

(ii) The test of bi < 62 in Problem 5.18(i). 


Section 6.7 

Problem 6.32 The definition of d-admissibility of a test coincides with the 
admissibility definition given in Section 1.8 when applied to a two-decision 
procedure with loss 0 or 1 as the decision taken is correct or false. 

Problem 6.33 (i) The following example shows that a-admissibility does not 

always imply d-admissibility. Let .Y be distributed as U(0,9), and consider 
the tests pi and p 2 which reject when respectively „Y < 1 and JY < | for 
testing H : 9 = 2 against K : 9 — 1. Then for a = |, pi and p 2 are both 
a-admissible but p 2 is not d-admissible. 

(ii) Verify the existence of the test po of Example 6.7.12. 

Problem 6.34 (i) The acceptance region Ti/yflS < C of Example 6.7.13 is 

a convex set in the (7 i,T 2) plane. 

(ii) In Example 6.7.13, the conditions of Theorem 6.7.1 are not satisfied for the 
sets A : TilsflS < C and Q! : £ > k. 

Problem 6.35 (i) In Example 6.7.13 (continued) show that there exist Co, 

Ci such that Ao (rj) and Ai (r;) are probability densities (with respect to 
Lebesgue measure). 

(ii) Verify the densities ho and hi. 

Problem 6.36 Verify 

(i) the admissibility of the rejection region (6.24); 

(ii) the expression for I(z) given in the proof of Lemma 6.7.1. 

Problem 6.37 Let Xi ,..., -Y m ; Yi,..., Y n be independent N (£, a 2 ) and N (r/, a 2 ) 
respectively. The one-sided t-test of H : S = £/cr < 0 is admissible against the 
alternatives (i) 0 < 5 < <5i for any di > 0; (ii) 5 > S 2 for any 82 > 0. 

Problem 6.38 For the model of the preceding problem, generalize Example 
6.7.13 (continued) to show that the two-sided t-test is a Bayes solution for an 
appropriate prior distribution. 

Problem 6.39 Suppose X = (Xi,..., Xk) T is multivariate normal with un¬ 
known mean vector (9 1 ,... ,9k) T and known nonsingular covariance matrix E. 
Consider testing the null hypothesis 9i = 0 for all i against 9i ^ 0 for some i. Let 
C be any closed convex subset of fc-dimensional Euclidean space, and let (j> be the 
test that accepts the null hypothesis if X falls in C. Show that (j> is admissible. 
Hint First assume E is the identity and use Theorem 6.7.1. [An alternative proof 
is provided by Strasser (1985, Theorem 30.4).] 
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Section 6.9 

Problem 6.40 Wilcoxon two-sample test. Let t/,, — 1 or 0 as Xi < Y) or X. t > 
Yj, and let U = ^ S U%j be the number of pairs Xi, Yj with Xi < Yj. 

(i) Then U = Yi Si — |n(n + 1), where Si <•••< S n are the ranks of the Y’s 
so that the test with rejection region U > C is equivalent to the Wilcoxon 
test. 

(ii) Any given arrangement of x’s and y’s can be transformed into the ar¬ 
rangement x ... xy... y through a number of interchanges of neighboring 
elements. The smallest number of steps in which this can be done for the 
observed arrangement is mn — U. 


Problem 6.41 Expectation and variance of Wilcoxon statistic. If the A'’s and 
T’s are samples from continuous distributions F and G respectively, the expec¬ 
tation and variance of the Wilcoxon statistic U defined in the preceding problem 
are given by 


(£)- p f x<r y-J F 


dG 


and 


mnVari^-^j = J F dG + (n - 1) J (1 - G) 2 dF 

J F 2 dG-(m + n-l)(^J FdG^J . 


(6.59) 

(6.60) 


Under the hypothesis G = F, these reduce to 


E 


U \ 


Var 


U 

mn 


m + n + 1 
12mn 


(6.61) 


Problem 6.42 (i) Let Z\, ..., Zn be independently distributed with den¬ 

sities fi ,..., /jv , and let the rank of Zi be denoted by T). If / is any 
probability density which is positive whenever at least one of the fi is 
positive, then 


P{Tl = tl, . . . ,T n = tn} = E 


h (Vuo) 

/ ow 


fN (V(t N) ) 


(6.62) 


where Vm < ••• < Vjjv) is an ordered sample from a distribution with 
density /. 


(ii) If N — m + n, fi = ■ ■ ■ = f m = /, f m +1 = • • • = f m + n = g, and 
Si < ■ ■ ■ < S n denote the ordered ranks of Z m + 1 , • • •, Z m + n among all the 
Z’s, the probability distribution of Si ,..., S n is given by (6.27). 

[(i): The probability in question is f ... f fi(zi)... f n(zn') dzi ■ ■ ■ dzN integrated 
over the set in which Zi is the t»th smallest of the z’s for i = 1 ,..., N. Under the 
transformation wt t = Zi the integral becomes f ■ ■ ■ f /i(wti) • • • /jv(wt^) dwi ■ ■ ■ dwN 
integrated over the set wi < ■ ■ ■ < wn ■ The desired result now follows from the 
fact that the probability density of the order statistics Vm < • • • < U(jv) is 
N\f (wi) ■ ■ ■ f(wN) for wi < ... < wn-] 
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Problem 6.43 (i) For any continuous cumulative distribution function F, 

define F' _1 (0) = — oo, F~ 1 (y) = inf{* : F(x) = y} for 0 < y < 1, F _1 (l) = 
oo if F(x) < 1 for all finite x, and otherwise inf{a; : F(x) = 1}. Then 
F[F~ 1 (y)] = y for all 0 < y < 1, but F~ 1 [F(y)] may be < y. 

(ii) Let Z have a cumulative distribution function G(z) = h[F(z)\, where F 
and h are continuous cumulative distribution functions, the latter defined 
over (0,1). If Y = F(Z), then P{Y < y} = h(y) for all 0 < y < 1. 

(iii) If Z has the continuous cumulative distribution function F, then F(Z) is 
uniformly distributed over (0, 1). 

[(ii): P{F(Z) <y} = P{Z < F~ 1 (y)} = F[F~\y)} = y.) 


Problem 6.44 Let Z, have a continuous cumulative distribution function F t 
( i = 1, ..., N), and let G be the group of all transformations Z[ = f{Zi) such 
that / is continuous and strictly increasing. 

(i) The transformation induced by / in the space of distributions is F' = 

Hr 1 )- 

(ii) Two IV-tuples of distributions (Pi,..., Fjv) and (F{,...,Fjy) belong to 
the same orbit with respect to G if and only if there exist continuous 
distribution functions hi,... ,h,N defined on (0,1) and strictly increasing 
continuous distribution functions F and F’ such that Fj = hi(F) and 
F[ = hi(F'). 

[(i): <y} = P{Z t < r\y)} = Fi[f-\y )]. 

(ii): If Fi = hi(F) and the F[ are on the same orbit, so that F[ = Fi(f J ), then 
F[ = hi(F') with F' = P(/ _1 ). Conversely, if F, = hi(F), F[ = hi(F'), then 

Fl = F i {f~ 1 ) with/ = P , - 1 (P).] 


Problem 6.45 Under the assumptions of the preceding problem, if Fi = hi(F), 
the distribution of the ranks Ti,..., Tjv of Z\, ..., Zn depends only on the hi, 
not on F. If the hi are differentiable, the distribution of the T % is given by 


P{Ti = ti,.. ■, Tn = tn } = 


E[h' 1 {U (tl) )...h' N (U {tN) )] 

N\ 


(6.63) 


where f7m < • • • < U( jv) is an ordered sample of size N from the uniform distribu¬ 
tion U( 0,1). [The left-hand side of (6.63) is the probability that of the quantities 
F(Z i),..., F(Znr), the ith one is the Uth smallest for i = L,..., N. This is given 
by f ... f h'i(yi)... h' N (yN) dy integrated over the region in which yi is the t*th 
smallest of the j/’s for i = 1,..., N. The proof is completed as in Problem 6.42.] 


Problem 6.46 Distribution of order statistics. 

(i) If Z\,..., Zn is a sample from a cumulative distribution function F with 
density /, the joint density of Yi = Z( s ), i = 1,... ,n, is 


N\f(yi)...f(y n ) 

( Sl - l)\{s 2 - si - 1)\... (N - s n )\ 


(6.64) 


x [Fiyi)] 31 - 1 ^) - F(yi jp — 1 ... [1 - F(y n )] N ~ 


for yi < ■■• < ijn- 
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(ii) For the particular case that the Z’s are a sample from the uniform 
distribution on (0,1), this reduces to 


N\ 

(ai-l)!(a 2 -si-l )!...(JV-s n )! 


(6.65) 


y s i 1 (2/2 — 2/i) 


* 2 - 01-1 


■ ■ ■ (1 - Dn) 


For n = 1, (6.65) is the density of the beta-distribution B s ,n-s+i, which 
therefore is the distribution of the single order statistic Z( s ) from (7(0,1). 

(iii) Let the distribution of Yi,..., Y n be given by (6.65), and let Vj be defined 
by Yi = ViVi+i ... V n for i = 1,..., n. Then the joint distribution of the Vi 
is 


N\ 


i=1 


(S n + 1 — N + 1), 


so that the V) are independently distributed according to the beta- 
distribution B ai , 3i+1 - Si . 

[(i): If Y\ = Z( S1 ),... ,Y n = Z (Sn) and Y n +i,. ■ ■ ,Yn are the remaining Z’s in 
the original order of their subscripts, the joint density of Yi,... ,Y n is N(N — 
1)... (N — n + 1) / • • • / f(y n + 1 ) ■ • • /(vn) dy n+1 ... dy N integrated over the region 
in which si — 1 of the y’s are < j/i,S 2 — si — 1 between y\ and 1 / 2 , • •and 
N — s n > y-n■ Consider any set where a particular ai — 1 of the y’s is < y\, 
a particular S 2 — si — 1 of them is between y\ and y 2 , and so on, There are 
An/(si — 1)!... (N — s„)! of these regions, and the integral has the same value 
over each of them, namely [T(yi)] sl " 1 [F(i/ 2 )-T’(yi)] S2_<,1_1 • • • [1-F’(t/„)] JV-Sn .] 


Problem 6.47 (i) If Xi,..., X m and Yi,. .., Y„, are samples with continuous 

cumulative distribution functions F and G = h(F) respectively, and if h 
is differentiable, the distribution of the ranks Si < ... < S„ of the l r ’s is 
given by 


P{Si = si,..., S„ = s„} 


E[h' (U {si) )...h' (f/ (sra) )] 

/ m-\-n\ 

V m J 


( 6 . 66 ) 


where t/(i) < • • • < f7( m +„) is an ordered sample from the uniform 
distribution C7(0,1). 

(ii) If in particular G = F k , where A; is a positive integer, (6.66) reduces to 


P{Si = si,... ,S n = s n } (6.67) 

k n A r (aj+jk-j) r(sj +1 ) 
raM r fe) r (Sj+i+jk-jy 


Problem 6.48 For sufficiently small 9 > 0, the Wilcoxon test at level 

1 N^ 


a = k 


k a positive integer, 


maximizes the power (among rank tests) against the alternatives (P, G) with 
G = (1 - 6 )F + 6 F 2 . 
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Problem 6.49 An alternative proof of the optimum property of the Wilcoxon 
test for detecting a shift in the logistic distribution is obtained from the preceding 
problem by equating F(x — 9) with (1 — 6 )F{x) + 9F 2 (x), neglecting powers 
of 9 higher than the first. This leads to the differential equation F — 9F' = 
(1 — 9)F + 9F 2 , the solution of which is the logistic distribution. 

Problem 6.50 Let F o be a family of probability measures over (X,A), and let 
C be a class of transformations of the space X. Define a class F\ of distributions 
by F\ £ Fi if there exists Fo £ Fo and / £ C such that the distribution of f(X) 
is F\ when that of X is Fo. If <t> is any test satisfying (a) Ef 0 (/)(X) = a for all 
Fo £ Fo, and (b) </>(*) < <j>[f{x)\ for all x and all / £ C, then <j> is unbiased for 
testing Fo against Fi 

Problem 6.51 Let Xi,..., X m ; Yi,...,Y n be samples from a common contin¬ 
uous distribution F. Then the Wilcoxon statistic U defined in Problem 6.40 is 
distributed symmetrically about |mn even when m ^ n. 

Problem 6.52 (i) If X\, ..., X m and Y\ ,..., Y n are samples from F(x) and 

G(y) = F(y — A) respectively (F continuous), and D (!) < ••• < 
denote the ordered differences Y) — Xi, then 

P [D( k ) < A < D (mn+1 _ t) ] = Po[k < U < mn - k], 

where U is the statistic defined in Problem 6.40 and the probability on the 
right side is calculated for A = 0. 

(ii) Determine the above confidence interval for A when m = n = 6, the 
confidence coefficient is and the observations are x : .113, .212, .249, 
.522, .709, .788, and y : .221, .433, .724, .913, .917, 1.58. 

(iii) For the data of (ii) determine the confidence intervals based on Student’s 
t for the case that F is normal. 

Hint: D(i ) < A < D^ + 1 ) if and only if Ua = mn — i, where Ua is the statistic U 
of Problem 6.40 calculated for the observations 

X-! ,..., X m ; Yi — A,, Yn — A. 

[An alternative measure of the amount by which G exceeds F (without assuming 
a location model) is p = P{X < y}. The literature on confidence intervals for p 
is reviewed in Mee (1990).] 

Problem 6.53 (i) Let A', X' and Y, Y'' be independent samples of size 2 

from continuous distributions F and G respectively. Then 

p = P{max(X, X') < min(y, Y')} + P{max(y, Y') < min(X, A' , )| 

= I+2A, 

where A = f(F - G ) 2 d[{F + G)/2\. 

(ii) A = 0 if and only if F = G. 

[(i): p = f (1 — F ) 2 dG 2 + f(l — G ) 2 dF 2 which after some computation reduces 
to the stated form. 
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(ii): A = 0 implies F(x) = G(x) except on a set N which has measure zero 
both under F and G. Suppose that G(xi) — F(x i) = r) > 0. Then there exists 
xo such that G(xo) = F(x o) + and F{x) < G(x) for xo < x < xi. Since 
G(x i) — G(x o) > 0, it follows that A > 0.] 

Problem 6.54 Continuation. 

(i) There exists at every significance level a a test of H : G = F which has 
power > a against all continuous alternatives (F, G ) with F ^ G. 

(ii) There does not exist a nonrandomized unbiased rank test of H against all 
G ^ F at level 



[(i): let Xi,X[\Yi,Y{ (i = l,...,n) be independently distributed, the A’s with 
distribution F, the Y’s with distribution G, and let Vi = 1 if max(AT,A'() < 
min(Y, Y/) or max(Y,Y)') < min(Aj,A(), and V. = 0 otherwise. Then has 
a binomial distribution with the probability p defined in Problem 6.53, and the 
problem reduces to that of testing p = | against p > |. 

(ii): Consider the particular alternatives for which P{A' < Y} is either 1 or 0.] 

Problem 6.55 (i) Let AT,..., X m ; IT, ..., Y n be i.i.d. according to a contin¬ 

uous distribution F, let the ranks of the Y’s be Si <•••< S n , and let 
T = h(Si) + • • • + h(S n ). Then if either m = n or h(s) + h(N + 1 — s) is 
independent of s, the distribution of T is symmetric about n^2 i=1 h{i)/N. 

(ii) Show that the two-sample Wilcoxon and normal-scores statistics are 
symmetrically distributed under H, and determine their centers of 
symmetry. 

[(i): Let Si = N + 1 — Si, and use the fact that T' = ^2h(Sj) has the same 
distribution under H as T.\ 


Section 6.10 

Problem 6.56 (i) Let m and n be the numbers of negative and positive 

observations among Z\,... , Zn, and let Si <■■■< S n denote the ranks of 
the positive Z’s among \Zi\,... \Zn\- Consider the N + ^N(N — 1) distinct 
sums Zi + Zj with i = j as well as i ^ j. The Wilcoxon signed rank statistic 
X) Sj, is equal to the number of these sums that are positive. 

(ii) If the common distribution of the Z's is D, then 

™«» - PV(V - 1) / «(-,) « DW . 

[(i) Let K be the required number of positive sums. Since Z, + Zj is positive 
if and only if the Z corresponding to the larger of \Zi\ and \Zj\ is positive, 
K = where Uij = 1 if Zj > 0 and \Zi\ < Zj and Uij = 0 

otherwise.] 
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Problem 6.57 Let Z \,..., Zn be a sample from a distribution with density 
f(z — 9), where f(z) is positive for all z and / is symmetric about 0, and let m, 
n, and the Sj be defined as in the preceding problem. 


(i) The distribution of n and the Sj is given by 

P{the number of positive Z's is n and Si = si,..., S n = s„} (6.68) 

1 J f (Vtn) + 0) • • • / (V(r m ) + 9)f (V (S1) (v (Sn) - e) 

2N [ f{v w )...f(v m ) J’ 

where V(i) < ••• < V(N), is an ordered sample from a distribution with 
density 2 f(v) for v > 0, and 0 otherwise. 


(ii) The rank test of the hypothesis of symmetry with respect to the origin, 
which maximizes the derivative of the power function at 9 = 0 and hence 
maximizes the power for sufficiently small 9 > 0, rejects, under suitable 
regularity conditions, when 


-E 


f (V^j) 

h f{y w. 


> C. 


(iii) In the particular case that f(z) is a normal density with zero mean, the 
rejection region of (ii) reduces to E(V( s j) > C> where V(i) < • • • < Vjjv) 
is an ordered sample from a y-distribution with 1 degree of freedom. 


(iv) Determine a density / such that the one-sample Wilcoxon test is most 
powerful against the alternatives f(z — 9) for sufficiently small positive 9. 


[(i): Apply Problem 6.42(i) to find an expression for P{S i = si,...,S n = s„ 
given that the number of positive Z’s is n}.] 


Problem 6.58 An alternative expression for (6.68) is obtained if the distribution 
of Z is characterized by ( p,F,G ). If then G = h(F) and h is differentiable, the 
distribution of n and the Sj is given by 

P m ( 1 - P) n E [h'(U (si) ) • • • h'(U M )} , (6.69) 

where U( i), < • • • < U(n) is an ordered sample from U( 0, 1 ). 


Problem 6.59 Unbiased tests of symmetry. Let Z i,..., Zn, be a sample, and 
let 4 > be any rank test of the hypothesis of symmetry with respect to the origin 
such that Zi < z\ for all i implies ..., zn) < <t>{z'i, ■ ■ ■, z'N). Then (j> is 
unbiased against the one-sided alternatives that the Z’s are stochastically larger 
than some random variable that has a symmetric distribution with respect to the 
origin. 

Problem 6.60 The hypothesis of randomness. 7 Let Z\,... , Zn be indepen¬ 
dently distributed with distributions Pi,..., Fn, and let Ti denote the rank of Z, 
among the Z’s For testing the hypothesis of randomness Fi = ■ ■ ■ = Fn against 


7 Some tests of randomness are treated in Diaconis (1988). 
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the alternatives K of an upward trend, namely that Zi is stochastically increasing 
with i, consider the rejection regions 

J^iU>C (6.70) 

and 

Y J iE{V(t i ))>C, (6.71) 

where Vm < • • • < V(n) is an ordered sample from a standard normal distribution 
and where ti is the value taken on by T. 

(i) The second of these tests is most powerful among rank tests against the 
normal alternatives F = N(j + iS, a 2 ) for sufficiently small S. 

(ii) Determine alternatives against which the first test is a most powerful rank 
test. 

(iii) Both tests are unbiased against the alternatives of an upward trend; so is 

any rank test <j> satisfying <j>{z\, ..., zn) < ..., z' N ) for any two points 

for which i < j, Zi < Zj implies z[ < z'j for all i and j. 

[(iii): Apply Problem 6.50 with C the class of transformations z[ = Zi, z\ = fi(zi) 
for i > 1, where z < / 2 (a) < • • • < / n(z) and each /, is nondecreasing. If To is 
the class of iV-tuples (Fi,. .. ,Fn) with Fi = ■ ■ ■ = Fn, then T\ coincides with 
the class K of alternatives.] 


Problem 6.61 In the preceding problem let U t] = 1 if (j — i)(Zj — Zi) > 0, and 
= 0 otherwise. 

(i) The test statistic X) can be expressed in terms of the U's through the 
relation 

X> - + a,(w+1 6 )(w + 2) . 

i= 1 i<j 

(ii) The smallest number of steps [in the sense of Problem 6.40(h)] by which 
(Z 1 ,..., Zn) can be transformed into the ordered sample (Z( 1 ),..., Z(n)) 
is [N(N— l)/2] — U, where U = U%j. This suggests U > C as another 
rejection region for the preceding problem. 

[(i): Let V).,■ = 1 or 0 as Z t < Z, or Zi > Zj. Then Tj = Vi j, and Vij = Uij or 
1 — Uij as i < j or i > j. Expressing J2f=i = 12f=i J X!i=i i n terms of the 
U’s and using the fact that Uij = Uj,, the result follows by a simple calculation.] 


Problem 6.62 The hypothesis of independence. Let (Xi, Yi)...., (Xn, Yn) be a 
sample from a bivariate distribution, and (JCm, Z\), ..., (X(jv), Zn) be the same 
sample arranged according to increasing values of the X’s so that the Z’s are 
a permutation of the P’s. Let Ri be the rank of Xf among the A"’s, Si the 
rank of Y z among the P’s, and Ti the rank of Zi among the Z’s, and consider 
the hypothesis of independence of X and Y against the alternatives of positive 
regression dependence. 
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(i) Conditionally, given (Ain,..., Ai^n), this problem is equivalent to testing 
the hypothesis of randomness of the Z 's against the alternatives of an 
upward trend. 

(ii) The test (6.70) is equivalent to rejecting when the rank correlation 
coefficient 

E (Ri-R)(Sj-S) _ 12 A+l \ f N + l \ 

s/T,(Ri- R 2 )J2(Si- Sf \ 1 2 J V ‘ 2 ) 

is too large. 

(iii) An alternative expression for the rank correlation coefficient 8 is 

1 - N&N - 1 - whi Br* - 0 a . 

(iv) The test U > C of Problem 6.61 (ii) is equivalent to rejecting when 
Kendall’s 7-statistic Ej<y Vij/N(N — 1) is too large where V »j is +1 or 
— 1 as (Yj — Yi)(Xj — Xi) is positive or negative. 

(v) The tests (ii) and (iv) are unbiased against the alternatives of positive 
regression dependence. 


Section 6.11 

Problem 6.63 In Example 6.11.1, a family of sets S(x, y ) is a class of equivariant 
confidence sets if and only if there exists a set 77 of real numbers such that 

S{x, y) = [J {(I, rj)-.{x- if + {y- yf = r 2 }. 
ren 


Problem 6.64 Let Xi } .... X n ; Y\,...,Y n be samples from N(i,o 2 ) and 
N{y,T 2 ) respectively. Then the confidence intervals (5.42) for r 2 /cr 2 , which can 
be written as 

W-Yf r 2 kUYj-Yf 
kJ2(Xi - Xf - a 2 - E(-E - X) 2 ’ 

are uniformly most accurate equivariant with respect to the smallest group G 
containing the transformations X[ = aX + b, Y[ = aY + c for all a ^ 0, b, c and 
the transformation X[ = dYi , Y( = Xi/d for all d ^ 0. 

[Cf. Problem 6.11.] 

Problem 6.65 (i) One-sided equivariant confidence limits. Let 9 be real¬ 

valued, and suppose that, for each 9q, the problem of testing 9 <6q against 
9 > 9o (in the presence of nuisance parameters t?) remains invariant under a 
group Ge 0 and that A[9q) is a UMP invariant acceptance region for this hy¬ 
pothesis at level a. Let the associated confidence sets S(x) = {9 : x £ A(9)} 


8 For further material on these and other tests of independence, see Kendall (1970), 
Aiyar, Guillier, and Albers (1979), Kallenberg and Ledwina (1999). 
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be one-sided intervals S(x) = {9 : 9(x) < 9}, and suppose they are equiv- 
ariant under all Gg and hence under the group G generated by these. Then 
the lower confidence limits #(X') are uniformly most accurate equivariant 
at confidence level 1 — a in the sense of minimizing Pe,^{9_(X) < 9 1 } for all 
9' < 9. 

(ii) Let Xi,..., X n be independently distributed as X(£, a 2 ). The upper con¬ 
fidence limits a 2 < — X) 2 /Co of Example 5.5.1 are uniformly most 

accurate equivariant under the group X[ = Xj + c, — oo < c < oo. They are 
also equivariant (and hence uniformly most accurate equivariant) under 
the larger group X' = aXi + c, —oo < a, c < oo. 


Problem 6.66 Counterexample. The following example shows that the equiv¬ 
alence of S(x) assumed in the paragraph following Lemma 6.11.1 does not follow 
from the other assumptions of this lemma. In Example 6.5.1, let n = 1, let G ^ 
be the group G of Example 6.5.1, and let G ® be the corresponding group when 
the roles of Z and Y = Yi are reversed. For testing H(9o) : 9 = 9o against 9 9q 
let Gg 0 be equal to G ^ augmented by the transformation Y' = 9q — (Yi — 9o) 
when 9 < 0, and let Gg 0 be equal to G* 2 - 1 augmented by the transformation 
Z' = 9o — (Z — 9o) when 9 > 0. Then there exists a UMP invariant test of H{9q) 
under Ge 0 for each 9o, but the associated confidence sets S(x) are not equivariant 
under G = {Gg, —oo < 9 < co}. 


Problem 6.67 (i) Let Xi,..., X n be independently distributed as N(£,a 2 ), 

and let 9 = £/a. The lower confidence bounds 9 for 9, which at confidence 
level 1 — a are uniformly most accurate invariant under the transformations 
X[ = aXi, are 

= n - 1 ( _\ 

\VJ2(Xi-X) 2 /(n-l)) 

where the function C(9) is determined from a table of noncentral t so that 


Pe 


_ VnX _ 

VlliXi ~ X)y(n - 1 ) 


<C(9) 


1 — a. 


(ii) Determine 9 when the *’s are 7.6, 21.2, 15.1, 32.0, 19.7, 25.3, 29.1, 18.4 
and the confidence level is 1 — a = .95. 


Problem 6.68 (i) Let (Xi, Yi),..., (X n , Y n ) be a sample from a bivariate 

normal distribution, and let 

„ _ n -i f ^(Xj-X^Yj-Y) \ 

- ' WE(A' t -X) 2 E (Yi-Y)*J ’ 

where C(p) is determined such that 


Pe 


EPfj ~ X)(Yi ~ Y) 
yfE( Xi -X)*Y:(Yi-Y)* 


<C{ P ) 


= 1 — a. 


Then p is a lower confidence limit for the population correlation coefficient 
p at confidence level 1 — a; it is uniformly most accurate invariant with 
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respect to the group of transformations X' = aXi + b, Y' = cYi + d, with 
ac > 0, —oo < b, d < oo. 

(ii) Determine p at level 1 — a = .95 when the observations are (12.9,.56), 
(9.8,.92), (13.1,.42), (12.5,1.01), (8.7,.63), (10.7,.58), (9.3,.72), (11.4,.64). 

Note. The following problems explore the relationship between pivotal 
quantities and equivariant confidence sets. For more details see Arnold (1984). 

Let X be distributed according Pe,$, and consider confidence sets for 0 that 
are equivariant under a group G *, as in Section 6.11. If w is the set of possible 
0-values, define a group G on A x w by g(9,x) = ( gx,g9 ). 

Problem 6.69 Let V (A', 0) be any pivotal quantity [i.e. have a fixed probability 
distribution independent of (0,$)], and let B be any set in the range space of V 
with probability P(V £ B) = 1 — a. Then the sets S(x) defined by 

9 £ S(x) if and only if V(9,x)£B (6.72) 

are confidence sets for 0 with confidence coefficient 1 — a. 

Problem 6.70 (i) If G is transitive over X x w and V(X, 0) is maximal 

invariant under G, then V(X, 0) is pivotal. 

(ii) By (i), any quantity W(X, 0) which is invariant under G is pivotal; give an 
example showing that the converse need not be true. 

Problem 6.71 Under the assumptions of the preceding problem, the confidence 
set S(x) is equivariant under G*. 

Problem 6.72 Under the assumptions of Problem 6.70, suppose that a family 
of confidence sets S(x) is equivariant under G*. Then there exists a set B in the 
range space of the pivotal V such that (6.72) holds. In this sense, all equivariant 
confidence sets can be obtained from pivotals. 

[Let A be the subset of 1 x id given by A = {(x,9) : 0 £ S(a:)}. Show that 
gA = A, so that any orbit of G is either in A or in the complement of A. Let the 
maximal invariant V (*, 0) be represented as in Section 6.2 by a uniquely defined 
point on each orbit, and let B be the set of these points whose orbits are in A. 
Then V ( x , 0) € B if and only if (x, 0) £ A.] Note. Problem 6.72 provides a simple 
check of the equivariance of confidence sets. In Example 6.12.2, for instance, the 
confidence sets (6.43) are based on the pivotal vector (Ai — £i,..., X r — £ r ), and 
hence are equivariant. 


Section 6.12 

Problem 6.73 In Examples 6.12.1 and 6.12.2 there do not exist equivariant sets 
that uniformly minimize the probability of covering false values. 

Problem 6.74 In Example 6.12.1, the density p(v) of V = l/S 2 is unimodal. 


Problem 6.75 Show that in Example 6.12.1, 
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(i) the confidence sets a 2 / S 2 £ A** with A** given by (6.42) coincide with 
the uniformly most accurate unbiased confidence sets for a 2 ; 

(ii) if (a, 6) is best with respect to (6.41) for a, then ( a r ,b r ) is best for a r 
(r > 0). 


Problem 6.76 Let X \,..., X r be i.i.d. N(0, 1), and let S 2 be independent of 
the X’s and distributed as xt- Then the distribution of (Xi/Sy/v,... ,X r /S\/v) 
is a central multivariate t-distribution, and its density is 


p{vi,...,V r ) 


r(|(^ + r)) / 1 2 

(7TJ/) r 7 2 r(i'/2) \ V 2—* ! 


Problem 6.77 The confidence sets (6.49) are uniformly most accurate equivari- 
ant under the group G defined at the end of Example 6.12.3. 


Problem 6.78 In Example 6.12.4, show that 

(i) both sets (6.57) are intervals; 

(ii) the sets given by vp(v) > C coincide with the intervals (5.41). 

Problem 6.79 Let Xi,.... X rn ; Yi,...,Y n be independently normally dis¬ 
tributed as N(£,a 2 ) and N(r/,a 2 ) respectively. Determine the equivariant 
confidence sets for r/ — £ that have smallest Lebesgue measure when 

(i) a is known; 

(ii) a is unknown. 

Problem 6.80 Generalize the confidence sets of Example 6.11.3 to the case that 
the Xi are N(£i, dicr 2 ) where the d’s are known constants. 

Problem 6.81 Solve the problem corresponding to Example 6.12.1 when 

(i) Xi,...,X n is a sample from the exponential density E(£,cr), and the 
parameter being estimated is a; 

(ii) Xi,..., X n is a sample from the uniform density U(£,£ + r), and the 
parameter being estimated is r. 

Problem 6.82 Let A'i,..., X n be a sample from the exponential distribution 
E(£, a). With respect to the transformations X[ — bXi+a determine the smallest 
equivariant confidence sets 

(i) for a, both when size is defined by Lebesgue measure and by the equivariant 
measure (6.41); 

(ii) for £. 

Problem 6.83 Let Xij (j = 1,..., m\ i = 1,..., s) be samples from the expo¬ 
nential distribution E(£i,cr). Determine the smallest equivariant confidence sets 
for (£i,..., £ r ) with respect to the group X[j = bXij + at. 
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6. Invariance 


Section 6.13 

Problem 6.84 If the confidence sets S(x) are equivariant under the group G, 
then the probability Pg{9 £ S(A')} of their covering the true value is invariant 
under the induced group G. 

Problem 6.85 Consider the problem of obtaining a (two-sided) confidence band 
for an unknown continuous cumulative distribution function F. 

(i) Show that this problem is invariant both under strictly increasing and 
strictly decreasing continuous transformations X[ — f(Xi), i = 1 

and determine a maximal invariant with respect to this group. 

(ii) Show that the problem is not invariant under the transformation 

( Xi if \Xi\ > 1, 

X'i = l Xi - 1 if 0 < Xi < 1, 

Ai-)-l if — 1 < Ai < 0. 

[(ii): For this transformation g, the set g*S(x) is no longer a band.] 


6.15 Notes 

Invariance considerations were introduced for particular classes of problems by 
Hotelling (1936) and Pitman (1939b). The general theory of invariant and almost 
invariant tests, together with its principal parametric applications, was developed 
by Hunt and Stein (1946) in an unpublished paper. In their paper, invariance 
was not proposed as a desirable property in itself but as a tool for deriving 
most stringent tests (cf. Chapter 8). Apart from this difference in point of view, 
the present account is based on the ideas of Hunt and Stein, about which E. 
L. Lehmann learned through conversations with Charles Stein during the years 
1947-1950. 

Of the admissibility results of Section 6.7, Theorem 6.7.1 is due to Birnbaum 
(1955) and Stein (1956a); Example 6.7.13 (continued) and Lemma 6.7.1, to Kiefer 
and Schwartz (1965). 

The problem of minimizing the volume or diameter of confidence sets is treated 
in DasGupta (1991). 

Deuchler (1914) appears to contain the first proposal of the two-sample pro¬ 
cedure known as the Wilcoxon test, which was later discovered independently by 
many different authors. A history of this test is given by Kruskal (1957). Hoeffd- 
ing (1951) derives a basic rank distribution of which (6.20) is a special case, and 
from it obtains locally optimum tests of the type (6.21). 



7 

Linear Hypotheses 


7.1 A Canonical Form 

Many testing problems concern the means of normal distributions and are special 
cases of the following general univariate linear hypothesis. Let X\,...,X n be 
independently normally distributed with means £ 1 ,... ,£ n and common variance 
<j 2 . The vector of means 1 £ is known to lie in a given s-dimensional linear subspace 
rifi ( s < n), and the hypothesis H to be tested is that £ lies in a given (s — 
r)-dimensional subspace of tin (r < s). 

Example 7.1.1 In the two-sample problem of testing equality of two normal 
means (considered with a different notation in Section 5.3), it is given that = £ 
for i = 1,..., m and £* = r/ for i = m + 1,..., ni + n 2 , and the hypothesis to be 
tested is r/ = £. The space E[ n is then the space of vectors 

(£> •••,£, r?, ■■■, rf) = |(1,..., X, 0,..., 0) + r)( 0,..., 0,1,..., 1) 

spanned by (1,..., 1,0,..., 0) and (0,..., 0,1,..., 1), so that s = 2. Similarly, 
fj is the set of all vectors (£,..., £) = £(1,..., 1) and hence r = 1. 

Another hypothesis that can be tested in this situation is ri = £ = 0. The 
space EL * s then the origin, s — r = 0 and hence r = 2. The more general 
hypothesis £ = £.o,V = Vo is not a linear hypothesis, since ]"I does not contain 
the origin. However, it reduces to the previous case through the transformation 
X[ = Xi - £ 0 (i =sl,...,m), X- = Xi-r )o (i = n 1 + l,...,n 1 + n 2 ). 


1 Throughout this chapter, a fixed coordinate system is assumed given in n-space. A 
vector with components fi,. • • > in is denoted by £, and an n X 1 column matrix with 
elements £ 1 ,..., by 
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Example 7.1.2 The regression problem of Section 5.6 is essentially a linear 
hypothesis. Changing the notation to make it conform with that of the present 
section, let = a + Pti, where a, p are unknown, and the ti known and not 
all equal. Since is the space of all vectors a(l, ..., 1) + P(t\, ..., t„), it has 
dimension s = 2. The hypothesis to be tested may be a = (3 = 0 (r = 2) or it 
may only specify that one of the parameters is zero (r = 1). The more general 
hypotheses a = ao, P = po can be reduced to the previous case by letting 
X'i = Xi — ao, ~PoU, since then E(X[) = a' + p'U with a' = a — ao, P' = P~ /Jo- 
Higher polynomial regression and regression in several variables also fall under 
the linear-hypothesis scheme. Thus if = a + pti + 7 t 2 or more generally £; = 
a + pti + ■yui, where the ti and m are known, it can be tested whether one or 
more of the regression coefficients a,P, 7 are zero, and by transforming to the 
variables X[ = Xi — ao — Poti — 70 Ui also whether these coefficients have specified 
values other than zero. ■ 


In the general case, the hypothesis can be given a simple form by making an 
orthogonal transformation to variables Y 1 ,..., Y n 


Y = CX , C=( Cij ) i,j = l,...,n, (7.1) 


such that the first s row vectors c 1 ,...,c s of the matrix C span with 

c r+ i,_ ,c s , spanning Then Y s+ i — ••• — Y„ — 0 if and only if X_ is in 

Hjj, and Y\ = • • • = Y r =? T s +i =••*'■■<=? Y n == 0 if and only if X is in J"I w . 
Let r/i = E(Yi), so that ri = C£. Then since £ lies in a priori and in J"J 
under H, it follows that r/i = 0 for i = s + l,...,n in both cases, and 7 ; = 0 
for i = 1,... ,r when H is true. Finally, since the transformation is orthogonal, 
the variables Y\ ,. .., Y n are again independent and normally distributed with 
common variance a 2 , and the problem reduces to the following canonical form. 

The variables Yi,..., Y„ are independently, normally distributed with common 
variance a 2 and means E(Yi) = r)i for i = 1, ...,s and E(Yi) = 0 for i = 
s + 1 ,..., n, so that their joint density is 


(\/27r a)’ 


exp 


-^2 ( - w) 2 + I] Vi 

1 i=s+l 


The 7 ’s and a 2 are unknown, and the hypothesis to be tested is 
H : r/i — ■■■ — 7 ] r = 0 (r < s < n). 


(7.2) 


(7.3) 


Example 7.1.3 To illustrate the determination of the transformation (7.1), con¬ 
sider once more the regression model £* = a + pti, of Example 7.1.2. It was 
seen there that ]~[q is spanned by (1,..., 1) and (ti,... ,t n ). If the hypothe¬ 
sis being tested is P = 0 , is the one-dimensional space spanned by the 
first of these vectors. The row vector c 2 is in J"J and of length 1, and hence 
c 2 = (1/ \fn ,..., 1 / y/n). Since c 1 is in ]~[q , of length 1, and orthogonal to c 2 , its co¬ 
ordinates are of the form a+bti, i = 1 ,..., n, where a and b are determined by the 
conditions X]( a + &i») = 0 and ^2(a + bti ) 2 = 1. The solutions of these equations 
are a = —bit, b = l/y/^ 2 (tj — t) 2 , and therefore a + bti = (ti — t)/\/J 2 (tj ^ ^) 2 > 
and 

v _ E Xi(tj-t) _ E (Xi - x)(u -1) 

1 VEfe-t ) 2 x/E (tj-t ) 2 
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The remaining row vectors of C can be taken to be any set of orthogonal unit 
vectors that are orthogonal to it turns out not to be necessary to determine 
them explicitly. 

If the hypothesis to be tested is a = 0, is spanned by (t i, ..., t n ), so that 
the ith coordinate of c 2 is ti / y^E t 2 - The coordinates of c x are again of the form 
a + bti with a and b now determined by the equations E( a + bti)ti = 0 and 
J2( a + bU) 2 = 1. The solutions are b = —ani/^t 2 , a = / n E(Ai — t) 2 , 

and therefore 


Yi 


nJ2t 2 

E (ti -1) 2 


X- 




In the case of the hypothesis a = j3 = 0, is the origin, and Cj, c 2 can be taken 
as any two orthogonal unit vectors in One possible choice is that appropriate 
to the hypothesis (3 = 0, in which case Yi is the linear function given there and 
Y 2 = y/xX. m 


The general linear-hypothesis problem in terms of the Y’s remains invariant 
under the group Gi of transformations Y{ = Yi + a for i = r + 1,..., s; Yi' = Yi 
for i = 1,..., r; s + 1,..., n. This leaves Yi,..., Y r and Yi+i,..., Y'„ as maximal 
invariants. Another group of transformations leaving the problem invariant is the 
group G 2 of all orthogonal transformations of Yi,..., Y r . The middle set of vari¬ 
ables having been eliminated, it follows from Example 6.2.l(iii) that a maximal 
invariant under G 2 is U = EI=i E 2 i Y s +i,..., Y n . This can be reduced to U and 
V = E™=s+i T ; 2 by sufficiency. Finally, the problem also remains invariant under 
the group G 3 of scale changes Y.[ = cYi,c ^ 0, for i = 1,... ,n. In the space 
of U and V this induces the transformation U* = c 2 U, V* = c 2 V, under which 
W = U/V is maximal invariant. Thus the principle of invariance reduces the data 
to the single statistic 2 

r 

E Y ? 

W= -• (7.4) 

E 

i=s +1 

Each of the three transformation groups Gi (i = 1,2, 3) which lead to the above 
reduction induces a corresponding group Gi in the parameter space. The group 
G 1 consists of the translations ??' = r/i+d (i = r+1 ,..., s), = rji (i = 1,..., r), 

a' = a, which leaves (r/i ,..., r/ r , a) as maximal invariants. Since any orthogonal 
transformation of Yi,..., Y r induces the same transformation on 771, ..., rj r and 
leaves a 2 unchanged, a maximal invariant under G 2 is ^EI=i»??, • Finally the 

elements of G 3 are the transformations r/'i = cr/i, a' = |c|a, and hence a maximal 
invariant with respect to the totality of these transformations is 

r 

E rfi 

V’ 2 = (7.5) 


2 A 

(1980). 


corresponding reduction without assuming normality is discussed by Jagers 
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It follows from Theorem 6.3.2 that the distribution of W depends only on ip 2 , 
so that the principle of invariance reduces the problem to that of testing the 
simple hypothesis H : ip = 0. More precisely, the probability density of W is (cf. 
Problems 7.2 and 7.3) 


Pip(ui) = e 2 ^ 


fc =0 


av' 2 ) 


wi r ~ 1+k 

(1 _ l _ w ) 5 p+™- s )+fc ’ 


(7.6) 


where 

P [§(?• + n — s) + fcl 

°k = —T -s-• 

r (|r + fc)r[|(n-s)] 

For any ip\ the ratio p^ 1 (w)/p 0 (w) is an increasing function of w, and it follows 
from the Neyman-Pearson fundamental lemma that the most powerful invariant 
test for testing ip = 0 against ip = ipi rejects when W is too large, or equivalently 
when 


t Y?/r 

W* = -—- > C. (7.7) 

E Y?/in-s) 


The cutoff point C is determined so that the probability of rejection is a when 
ip = 0. Since in this case W* is the ratio of two independent \ 2 variables, each 
divided by the number of its degrees of freedom, the distribution of W* is the 
T-distribution with r and n — s degrees of freedom, and hence C is determined 
by 


/” 


F r ,n- S (y)dy = a. 


(7.8) 


The test is independent of tpi, and hence is UMP among all invariant tests. By 
Theorem 6.5.2, it is also UMP among all tests whose power function depends 
only on ip 2 . 

The rejection region (7.7) can also be expressed in the form 


r 

E y? 

r i=1 n - > C. (7.9) 

E Y? + E E 2 

i =1 i=s+l 


When ip = 0, the left-hand side is distributed according to the beta-distribution 
with r and n — s degrees of freedom [defined through (5.24)], so that C' is 
determined by 

J^Bi r i {n _ s) (y)dy = a. (7.10) 

For an alternative value of ip, the left-hand side of (7.9) is distributed according 
to the noncentral beta-distribution with noncentrality parameter ip, the density 
of which is (Problem 7.3) 


9 Ay) = e 




oo ± 


E 




^ ir+fc, 4 (ri — s) (9) ‘ 


fc! 


( 7 . 11 ) 
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The power of the test against an alternative ip is therefore 3 

PM= [ 9i>(y)dy- 

JC' 

In the particular case r = 1 the rejection region (7.7) reduces to 

- > C 0 . (7.12) 

n — s) 

This is a two-sided t -test which by the theory of Chapter 5 (see for example 
Problem 5.5) is UMP unbiased. On the other hand, no UMP unbiased test exists 
for r > 1 . 

The .F-test (7.7) shares the admissibility properties of the two-sided f-test 
discussed in Section 6.7. In particular, the test is admissible against distant al¬ 
ternatives ip 2 > ip 2 (Problem 7.6) and against nearby alternatives ip 2 < ip 2 
(Problem 7.7). It was shown by Lehmann and Stein (1953) that the test is in 
fact admissible against the alternatives ip 2 < ip 2 for any ipi and hence against all 
invariant alternatives. 


\Yi\ 


E E 2 /( 

i=s+l 


7.2 Linear Hypotheses and Least Squares 

In applications to specific problems it is usually not convenient to carry out the 
reduction to canonical form explicitly. The test statistic W can be expressed in 
terms of the original variables by noting that ELs +1 V is the minimum value 
of 

s n n 

E^ - ^) 2 + E V = E - e ( y )] 2 

i =1 i=s+l i=1 

under unrestricted variation of the r/’s. Also, since the transformation Y = CX 
is orthogonal and orthogonal transformations leave distances unchanged, 

n n 

E [Yi - E(Yi)] 2 = E( Xi - &) 2 - 

i =1 i=1 

Furthermore, there is a 1 : 1 correspondence between the totality of s-tuples 
(rji, ... ,rj s ) and the totality of vectors £ in P n . Hence 

n n 

E y 2 = E(^-£) 2 > ( 7 - 13 ) 

i=s+l i= 1 

where the £’s are the least-squares estimates of the £’s under that is, the values 
that minimize (Xj — ^) 2 subject to ^ in 


3 Tables of the power of the F-test are provided by Tiku (1967, 1972) [reprinted in 
Graybill (1976)] and Cohen (1977); charts are given in Pearson and Hartley (1972). 
Various approximations are discussed by Johnson, Kotz and Balakrishnan (1995). 
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In the same way it is seen that 

r n n 

Ee 2 + E Y? = ^2(Xi-l) 2 

i=1 i=s+1 i=1 


where the £’s are the values that minimize XX-'E — C;) 2 subject to £ in J - ^. The 
test (7.7) therefore becomes 


W* = 


E(^-C,) 2 - E(*-*0 S 

i=1 i=1 


> c, 


E {Xi - &) 2 /(n - s) 

i=1 


(7.14) 


where C is determined by (7.8). Geometrically the vectors C and C are the pro¬ 
jections of X on ]~[ n and ])"[ , so that the triangle formed by A, £, and £ has a 
right angle at C (see Figure 7.1). 



Thus the denominator and numerator of W*, except for the factors 1 /{n — s) 

and 1 /r, are the squares of the distances between X and £ and between £ and £ 
respectively. An alternative expression for W* is therefore 

E(fc-O a A 

W* = ---. (7.15) 

E {Xi - £i) 2 /(n - s) 

i =1 

It is desirable to express also the noncentrality parameter ip 2 = Ei=i Vi /° 2 i n 
terms of the £’s. Now X = C~ X Y , £ = C _ 1 r/, and 

r n A n 

E e 2 = E( x * - If - E( x » - c*) 2 - ( 7 - 16 ) 

i =1 i =1 i=l 

If the right-hand side of (7.16) is denoted by /(A'), it follows that EI=i Vi — /(C)- 
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A slight generalization of a linear hypothesis is the inhomogeneous hypothesis 
which specifies for the vector of means £ a subhyperplane of not passing 
through the origin. Let ]~[ denote the subspace of which passes through the 
origin and is parallel to If C° is an y point of fE, the set Y[ u consists of the 
totality of points ( = (*+^° as ranges over . Applying the transformation 
(7.1) with respect to the vector of means r/ for £ £ JE is then given by 
r) = = C£* + <7 £ 0 in the canonical form (7.2), and the totality of these 

vectors is therefore characterized by the the equations r)i = r/i,... ,r] r = rfi. , 
r/s+i — * • • = r] n — 0, where rfl is the ith coordinate of Ct ; 0 . In the canonical form, 
the inhomogeneous hypothesis £ £ nL therefore becomes r/i = rft (i = 1 ,... ,r). 
This reduces to the homogeneous case on replacing Yj with Y t — if-, and it follows 
from (7.7) that the UMP invariant test has the rejection region 

tiY-ntf/r 

- > C , (7.17) 

E Y?/{n-s) 

i=s+l 

and that the noncentrality parameter is i /> 2 = EEifa* — Vi ) 2 /°' 2 - 

In applications it is usually most convenient to apply the transformation X % —tf 
directly to (7.14) or (7.15). It follows from (7.17) that such a transformation 
always leaves the denominator unchanged. This can also be seen geometrically, 
since the transformation is a translation of n-space parallel to ]~[ n and therefore 
leaves the distance E(^» — £*) 2 f rom K. to unchanged. The noncentrality 
parameter can be computed as before by replacing X with £ in the transformed 
numerator (7.16). 

Some examples of linear hypotheses, all with r = 1, were already discussed in 
Chapter 5. The following treats two of these from the present point of view. 


Example 7.2.1 Let X\...., X n be independently, normally distributed with 
common mean /r and variance a 2 , and consider the hypothesis H : fi = 0. Here 
fin is the line £* = •••=£„, IE i s H le origin, and s = r = 1. Let X — n -1 E; -E- 
From the identity 


E( x * - m ) 2 = E( x > - x ) 2 + n ( x - . 


it is seen that C = X, while = 0. The test statistic and ip 2 are therefore given 
by 


W = 


nX 2 

Y.{Xi-xy 


and ip 2 



Under the hypothesis, the distribution of (n — 1)W is that of the square of a 
variable having Student’s t-distribution with n — 1 degrees of freedom. ■ 


Example 7.2.2 In the two-sample problem considered in Example 7.1.1 with 
n = m + ri 2 , the sum of squares 

Til n 

E( x >-£) 2 + E ( x >-v ) 2 

i= 1 i=n\ +1 
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is minimized by 


£=x (i) = y 
tr ni 


while, under the hypothesis r/ — £ = 0, 


n 

f, = x^= y, 

i=n i+l 


n 2 ’ 


^ s mX (1) + n 2 X. (2) 

Z = V = X = ---. 

The numerator of the test statistic (7.15) is therefore 

m(X« - X ) 2 + n 2 (X (2) - X ) 2 = [x (2) - X (1) f . 

m + n 2 L J 


The more general hypothesis rj — £ = 9o reduces to the previous case on replacing 
Xi with Xi — 6 o for i = m + 1 ,..., n, and is therefore rejected when 


(x (2) -x (1) -g 0 )7(^ + ^) 

n / ,„n \21 / 


e(x-x. (1) ) 2 + E (xi — x. (2) V 

/ (m + n 2 - 2) 

z=l ' ' i=n\ +1 ' ' 

/ 


The noncentrality parameter is ip 2 = (r/ — £ — 9o) 2 /(1/ni + l/n 2 )cr 2 . Under 
the hypothesis, the square root of the test statistic has the f-distribution with 
m + n 2 — 2 degrees of freedom. ■ 


Explicit formulae for the and can be obtained by introducing a coordinate 
system into the parameter space. Suppose that, in such a system, J)[ n is defined 
by the equations 


U Yxtr'j- * = 1,... ,n, 

3=1 

or, in matrix notation, 


£ = A B , (7.18) 

nxl nxs sX 1 

where A is known and of rank s, and /3i,...,/3 a are unknown parameters. If 
... ,p s are the least-squares estimators minimizing JT(Xj — JT a,ij/3j) 2 , it is 
seen by differentiation that the (3j are the solutions of the equations 

A T Af3 = A r X 


and hence are given by 

P = (A t A)- 1 A t X. 

(That A t A is nonsingular follows by Problem 6.3.) Thus, we obtain 

£ = A{A t A)- 1 A t X. 

Since £ = £(X) is the projection of X into the space spanned by the s 
columns of A, the formula £ = A(A T A) -1 A T X shows that P = A(A T A) -1 A T 
has the property claimed for it in Example 6.2.3, that for any X in R n , PX is 
the projection of X into ]~[ n . 
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The UMP invariant test obtained in the preceding section for testing the equality 
of the means of two normal distributions with common variance is also UMP un¬ 
biased (Section 5.3). However, when a number of populations greater than 2 is to 
be tested for homogeneity of means, a UMP unbiased test no longer exists, so that 
invariance considerations lead to a new result. Let Xfj (j = 1,..., n»; i = 1,..., s) 
be independently distributed as N(fM,o 2 ), and consider the hypothesis 

H : pi = • • • = fj, s . 

This arises, for example, in the comparison of a number of different treatments, 
processes, varieties, or locations, when one wishes to test whether these differences 
have any effect on the outcome A'. It may arise more generally in any situation 
involving a one-way classification of the outcomes, that is, in which the outcomes 
are classified according to a single factor. In such situations, when rejecting H one 
will frequently want to know more about the fj ,s than just that they are unequal. 
The resulting multiple comparison problem will be discussed in Section 9.3. 

The hypothesis H is a linear hypothesis with r = s — 1, with ]”[ n given by 
the equations £ij = £ik for j, k = 1, ... ,n, i = 1, s and with the line on 
which all n = y) Hi coordinates are equal. We have 

E E(*« - a**) 2 = E E(*« - - Y n 2 +E ««(*<■ - m3 2 

with Xi. = J2’jLi Xij/m, and hence £ij = X;.. Also, 

EE( X « - /T = E E(^ - x -) 2 +"(*■■ - a 1 ) 2 


with A.. = y)y) A ij/n, so that = A'... Using the form (7.15) of W*, the test 
therefore becomes 


En*(AU-A..) 2 /(sm) 
EE(Xij - Ai.) 2 /(n-s) 


(7.19) 


The noncentrality parameter is 

^2 = -iO 2 


with 


H. = 


E 

n 


The sum of squares in both numerator and denominator of (7.19) admits 
three interpretations, which are closely related: (i) as the two components in 
the decomposition of the total variation 


E E(*« - A '") 2 = E E( x b - Xi.f + e MXi. - a..) 2 , 

of which the first represents the variation within, and the second the variation 
between populations; (ii) as a basis, through the test (7.19), for comparing these 
two sources of variation; (iii) as estimates of their expected values, (n — s)a 2 and 
(s— l)cr 2 + E Tii(/i» — fj,.) 2 (Problem 7.11). This breakdown of the total variation, 
together with the various interpretations of the components, is an example of 
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an analysis of variance , 4 which will be applied to more complex problems in the 
succeeding sections. 

When applying the principle of invariance, it is important to make sure that the 
underlying symmetry assumptions really are satisfied. In the problem of testing 
the equality of a number of normal means fxi,..., fj, s , for example, all parameter 
points, which have the same value of ip 2 = "^2 ni{iM — /t.) 2 /<r 2 , are identified under 
the principle of invariance. This is appropriate only when these alternatives can 
be considered as being equidistant from the hypothesis. In particular, it should 
then be immaterial whether the given value of ip 2 is built up by a number of small 
contributions or a single large one. Situations where instead the main emphasis 
is on the detection of large individual deviations do not possess the required 
symmetry, and the test based on (7.19) need no longer be optimum. 

The robustness properties against nonnormality of the F- test for testing equal¬ 
ity of means will be discussed using a large sample approach in Section 11.3, as 
well as the corresponding test for equality of variances. Alternatively, permutation 
tests will be applied in Section 15.2. 

Instead of assuming X,.j is normally distributed, suppose that Xij has distri¬ 
bution F(x — fu), where F is an arbitrary distribution with finite variance. If F 
has heavy tails, the test (7.19) tends to be inefficient. More efficient tests can be 
obtained by generalizing the considerations of Sections 6.8 and 6.9. Suppose the 
Xij are samples of size Hi from continuous distributions Fi (i = l,...,s) and 
that we wish to test H : F\ = • • • = F„. Invariance, by the argument of Section 
6.8, then reduces the data to the ranks Rij of the X, j in the combined sample 
of n = 53 ni observations. A natural analogue of the two-sample Wilcoxon test 
is the Kruskal-Wallis test, which rejects H when ^2m(Ri. — R..) 2 is too large. 
For the shift model Fi(y) = F(y — fif), the performance of this test relative to 
(7.19) is similar to that of the Wilcoxon to the f-test in the case s = 2; the notion 
of asymptotic relative efficiency will be developed in Section 13.2. The theory of 
this and related rank tests is developed in books on nonparametric statistics such 
as Randles and Wolfe (1979), Hettmansperger (1984), Gibbons and Chakraborti 
(1992), Lehmann (1998) and Hajek, Sidak and Sen (1999). 

Unfortunately, such rank tests are available only for the simplest linear mod¬ 
els. An alternative approach capable of achieving similar efficiencies for much 
wider classes of linear models can be obtained through large-sample theory, which 
will be studied in Chapters 11-15. Briefly, the least-squares estimators may be 
replaced by estimators with better efficiency properties for nonnormal distri¬ 
butions. Furthermore, asymptotically valid significance levels can be obtained 
through “Studentization” , 5 that is, by dividing the statistic by a suitable esti¬ 
mator of its standard deviation; see Section 11.3. Different ways of implementing 
such a program are reviewed, for example, by Draper (1981, 1983), McKean and 


4 For conditions under which such a breakdown is possible, see Albert (1976). 

*This term (after Student, the pseudonym of W. S. Gosset) is a misnomer. The pro¬ 
cedure of dividing the sample mean X by its estimated standard deviation and referring 
the resulting statistic to the standard normal distribution (without regard to the dis¬ 
tribution of the X’s) was used already by Laplace. Student’s contribution consisted of 
pointing out that if the X’s are normal, the approximate normal distribution of the 
(-statistic can be replaced by its exact distribution—Student’s (. 
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Schrader (1982), Ronchetti (1982) and Hettmansperger, McKean and Sheather 
(2000). [For a simple alternative of this kind to Student’s t-test, see Prescott 
(1975).] 

Sometimes, it is of interest to test the hypothesis H : fj,i = ■ ■ ■ = fj, s considered 
at the beginning of the section, against only the ordered alternatives /Ui < • • • < 
Us rather than against the general alternatives of any inequalities among the 
s. Then the T-test (7.19) is no longer reasonable; more powerful alternative 
tests for this and other problems involving ordered alternatives are discussed by 
Robertson, Wright and Dykstra (1988). The problem of testing H against one¬ 
sided alternatives such as K : p > 0 for all i. with at least one inequality strict, 
is treated by Perlman (1969) and in Barlow et al. (1972), which gives a survey of 
the literature; also see Tang (1994), Liu and Berger (1995) and Perlman and Wu 
(1999). Minimal complete classes and admissibility for this and related problems 
are discussed by Marden (1982a) and Cohen and Sackrowitz (1992). 


7.4 Two-Way Layout: One Observation per Cell 

The hypothesis of equality of several means arises when a number of different 
treatments, procedures, varieties, or manifestations of some other factors are to 
be compared. Frequently one is interested in studying the effects of more than one 
factor, or the effects of one factor as certain other conditions of the experiment 
vary, which then play the role of additional factors. In the present section we 
shall consider the case that the number of factors affecting the outcomes of the 
experiment is two. 

Suppose that one observation is obtained at each of a number of levels of these 
factors, and denote by Xij (i = 1 ,... ,a; j = 1 ,..., b) the value observed when 
the first factor is at the ith and the second at the jth level. It is assumed that the 
Xij are independently normally distributed with constant variance <r 2 , and for 
the moment also that the two factors act independently (they are then said to be 
additive ), so that pj is of the form a( + p'j. Putting = a'.+ p and m = a[ — a'., 
/3j = /3j — p'., this can be written as 

in = M + Pj, = T = °’ ( 7 - 2 °) 

where the a’s and p’s (the main effects of A and B) and ji are uniquely determined 

by (7.20) as 6 

<*=&■ Pj=i-j-i~, /*-=*£■• (7-21) 

Consider the hypothesis 

H : ai = • • • = a a = 0 (7.22) 

that the first factor has no effect on the outcome being observed. This arises in two 
quite different contexts. The factor of interest, corresponding say to a number 
of treatments, may be p, while a corresponds to a classification according to, 


6 The replacing of a subscript by a dot indicates that the variable has been averaged 
with respect to that subscript. 
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for example, the site on which the observations are obtained (farm, laboratory, 
city, etc.). The hypothesis then represents the possibility that this subsidiary 
classification has no effect on the experiment so that it need not be controlled. 
Alternatively, a may be the (or a) factor of primary interest. In this case, the 
formulation of the problem as one of hypothesis testing would usually be an 
oversimplification, since in case of rejection of H, one would require estimates of 
the a’s or at least a grouping according to high and low values. 

The hypothesis H is a linear hypothesis with r = a—1, s = l + (a—1) + ( 6 —1) = 
a+ 6 — 1, and n — s = (a—1)(6—1). The least-squares estimates of the parameters 
under Q. can be obtained from the identity 

El El ( x ij — £ij ) = Ej ( X ij — /r — ati — f3j) 

= e E - Xi - - x -j+ x -) + ( Xi - - x - «*) 

+ (X .J - A'.. - ft) + (X. - ft ] 2 

+ 6 ^(Xi. -X.. -a ;) 2 

+0 (X.j - X.. - + ab (X. - ft 2 , 

which is valid because in the expansion of the third sum of squares the cross- 
product terms vanish. It follows that 

ft = A,. — X.., Pi = X.j — X.., p = X.., (7.23) 

and that 

E E ( x « - E ) 2 = E E ( x *i - Xi • - x -*+ x -) 2 • 


Under the hypothesis H we still have ft = X.j — X.. and p = A'.., and hence 
£ii — = Xi. — X... The best invariant test therefore rejects when 


W* 


_ bJ2(Xj.-X..) 2 /(a-1) _ 

E E {Xij - Xi- - X-i + X-f /(a ~ 1)(6 - 1 ) 


(7.24) 


The noncentrality parameter, on which the power of the test depends, is given 

by 




b E«i 


(7.25) 


This problem provides another example of an analysis of variance. The total 
variation can be broken into three components, 


EE( X «-X .) 2 = &]T(X.-X ..) 2 + a]r(X.j-X ..) 2 

+ J2^Xij-Xi--Xi + X..) 2 . 

Of these, the first contains the variation due to the a’s, the second that due to 
the P’s. The last component, in the canonical form of Section 7.1, is equal to 
E"= s +i X 2 • It is therefore the sum of squares of those variables whose means are 
zero even under Q. Since this residual part of the variation, which on division by 
n — s is an estimate of <r 2 , cannot be attributed to any effects such as the a’s or 
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/?’s, it is frequently labeled “error,” as an indication that it is due solely to the 
randomness of the observations, not to any differences of the means. Actually, 
the breakdown is not quite as sharp as is suggested by the above description. Any 
component such as that attributed to the a’s always also contains some “error,” 
as is seen for example from its expectation, which is 

E^Xi. -X..) 2 = (a-l)a 2 + bY,al 


Instead of testing whether a certain factor has any effect, one may wish to 
estimate the size of the effect at the various levels of the factor. Other parameters 
that are sometimes interesting to estimate are the average outcomes (for example 
yields) £ 1 .,... ,£ a . when the factor is at the various levels. If 9i = [i + an = 
confidence sets for (9i,..., 9 a ) are obtained by considering the hypotheses H{9°) : 
9i = 9i(i = I,..., a). For testing 9 1 • • • — 0„ =0, the least-squares estimates 

of the £ij are = X + X.j — X.. and ^ • = X.j — X... The denominator sum of 
squares is therefore E E(A»j — A,. — X.j + A'..) 2 as before, while the numerator 
sum of squares is 

EE(4-4) 2 = 6E x '- 


The general hypothesis reduces to this special case on replacing Xij with the 
variable Xij — 0°. Since s = a + b — 1 and r = a, the hypothesis H(9°) is rejected 
when 


bJ2(Xi.-9°) 2 /c 


> c. 


E E(Aij - Xi. - X.j + X..) 2 /(a - 1 )(6 - 1) 

The associated confidence sets for (9 1 ,... ,9 a ) are the spheres 

aC J2J2(Xij - Xi. - X.j + A..) 2 


E(0i - A ,.) 2 < 


(a — 1 )(b — 1)6 


When considering confidence sets for the effects ai,...,a a , one must take 
account of the fact that the a’s are not independent. Since they add up to zero, 
it would be enough to restrict attention to ai,..., a a -i. However, an easier and 
more symmetric solution is found by retaining all the a’s. The rejection region of 
H : a.i — (x\ for i = 1,..., a (with E a i — 0) is obtained from (7.24) by letting 
X[j = Xij — a°i , and hence is given by 


6^(Xi. - A.. -a°) 2 > 


CEE(Aii - Xi. - X.j + A..) 2 
(b-1) 


The associated confidence set consists of the totality of points (ai,...,a 0 ) 
satisfying E a i = 0 and 


EK - (Ai. - A..)] 2 < 


CEE(Ay - Ai. - X.j + A..) 2 
6(6-1) 


In the space of (ai,..., a a ), this inequality defines a sphere whose center (Xi. — 
A..,..., X a . — A..) lies on the hyperplane E a i ~ 0. The confidence sets for the 
a’s therefore consist of the interior and surface of the great hyperspheres obtained 
by cutting the a-dimensional spheres with the hyperplane E = 0. 
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In both this and the previous case, the usual method shows the class of confi¬ 
dence sets to be invariant under the appropriate group of linear transformations, 
and the sets are therefore uniformly most accurate invariant. 

A rank test of (7.22) analogous to the Kruskal-Wallis test for the one-way 
layout is Friedman’s test, obtained by ranking the s observations A'i j,... ,X S j 
separately from 1 to s at each level j of the second factor. If these ranks are de¬ 
noted by Rij, ..., R s j, Friedman’s test rejects for large values of ^2{Ri- — R- ) 2 - 
Unless s is large, this test suffers from the fact that comparisons are restricted to 
observations at the same level of factor 2. The test can be improved by “align¬ 
ing” the observations from different levels, for example, by subtracting from each 
observation at the jth level its mean X.j for that level, and then ranking the 
aligned observations from 1 to ab. For a discussion of these tests and their effi¬ 
ciency see Lehmann (1998, Chapter 6), and for an extension to tests of (7.22) in 
the model (7.20) when there are several observations per cell, Mack and Skillings 
(1980). Further discussion is provided by Hettmansperger (1984) and Gibbons 
and Chakraborti (1992). 

That in the experiment described at the beginning of the section there is only 
one observation per cell, and that as a consequence hypotheses about the a’s 
and /3’s cannot be tested without some restrictions on the means £.,j, does not of 
course justify the assumption of additivity. Rather, it is the other way around: 
the experiment should not be performed with just one observation per cell unless 
the factors can safely be assumed to be additive. Faced with such an experiment 
without prior assurance that the assumption holds, one should test the hypothesis 
of additivity. A number of tests for this purpose are discussed, for example, in 
Hegemann and Johnson (1976) and Marasinghe and Johnson (1981). 


7.5 Two-Way Layout: m Observations Per Cell 

In the preceding section it was assumed that the effects of the two factors a and 
/J are independent and hence additive. The factors may, however, interact in the 
sense that the effect of one depends on the level of the other. Thus the effectiveness 
of a teacher depends for example on the quality or the age of the students, and 
the benefit derived by a crop from various amounts of irrigation depends on the 
type of soil as well as on the variety being planted. If the additivity assumption 
is dropped, the means of X are no longer given by (7.20) under S2 but are 
completely arbitrary. More than ab observations, one for each combination of 
levels, are then required, since otherwise s = n. We shall here consider only the 
simple case in which the number of observations is the same at each combination 
of levels. 

Let Xijk (i = 1 , ,a;j = 1 ,,b;k = 1,. .., m) be independent normal with 
common variance a 2 and mean E(Xijk) = £ij. In analogy with the previous 
notation we write 

c- = £■ + &■ -eo+ (£:»-£■)+ (&-&■-£* + £■■) 

= /r + on + f3j + 'yij 

with a i — Pi = J2i 7 ij = 7 H = 0. Then a; is the average effect of 

factor 1 at level i, averaged over the b levels of factor 2, and a similar interpretation 
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holds for the /3’s. The 7 ’s are called interactions, since 7 ij measures the extent 
to which the joint effect £,ij — £.. of factors 1 and 2 at levels i and j exceeds the 
sum (£j. — £..) + (£.3 — £..) of the individual effects. Consider again the hypothesis 
that the a’s are zero. Then r = a — 1, s = ab, and n — s = (m — l)ab. From the 
decomposition 

E E E(*«* - ^) 2 = E E E(*«* - *«-) 2 +^E E( x ^- - ^) 2 

and 

EE :A '-<--&) 2 = EE (A '"- ~ v - - x -i- + A - 

+6^(X;.. - X... - «i) 2 + aJ2( X -r - x - - Pi? 

+ab(X... — fj,) 2 

it follows that 

A = A = £■• = XL.., di = &. - £.. = Xj.. - X..., 

■h-h-P; t.-X.j. .V.... 

7*i = 7y = X ii- ~ X i - ~ X -i■ + -V.., 

and hence that 


E E E( A ^ - ^?-EE E( x « fc - A ' 


2 

j 


E E E«« - 4) 2 - ^ E( AA - - x -) 2 

The most powerful invariant test therefore rejects when 

= mb^2(Xj.. — X...) 2 / (a — 1) 

E J2J2i x ijk - Xij.) 2 / ( m - 1 )ab 

and the noncentrality parameter in the distribution of W* is 

-£■■? _ mb J2a 2 


(7.26) 


(7.27) 


Another hypothesis of interest is the hypothesis H' that the two factors are 
additive, 7 


H' : 7 ij = 0 for all i,j. 


The least-squares estimates of the parameters are easily derived as before, and 
the UMP invariant test is seen to have the rejection region (Problem 7.13) 


W* 


m E E (Xij- - Xi- - x -j- + X...) 2 /(q - l)(b - 1) 

E E E {Xijk - Xij ) 2 / (m - 1)06 


(7.28) 


7 A test of H' against certain restricted alternatives has been proposed for the case 
of one observation per cell by Tukey (1949a); see Hegemann and Johnson (1976) for 
further discussion. 
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Under H ', the statistic W* has the E-distribution with (a — l)(b— 1) and (m-l)ab 
degrees of freedom; the noncentrality parameter for any alternative set of 7 ’s is 


2= m EE7^ 

rr^ 


(7.29) 


The decomposition of the total variation into its various components, in the 
present case, is given by 


= mbJ2(Xi~-X...f + maJ2(Xi-X...) 2 

+ m E E' : - V. - X.j. + X...) 2 


+EEEfe- A v) 2 - 

Here the first three terms contain the variation due to the a’s, /3’s and 7 ’s respec¬ 
tively, and the last component corresponds to error. The tests for the hypotheses 
that the a’s, /3’s, or 7 ’s are zero, the first and third of which have the rejection 
regions (7.26) and (7.28), are then obtained by comparing the a, /3, or 7 sum of 
squares with that for error. 

An analogous decomposition is possible when the 7 ’s are assumed a priori to be 
equal to zero. In that case, the third component which previously was associated 
with 7 represents an additional contribution to error, and the breakdown becomes 

EEE ( X « fc - X -) 2 = mbJ2(Xi~-X...) 2 + maJ2(X-j--X...) 2 


+E E E( x '"'' - •*<••• - x ++ x -) 2 ’ 

with the last term corresponding to error. The hypothesis H : ai = • • • = a a = 0 
is then rejected when 

_ mbj:(X i ..-X...) 2 /(a-l) _ 

J2 J2 - Xi„ - X.j. + X...) 2 /(abm - a - b + 1 ) 

Suppose now that the assumption of no interaction, under which this test was 
derived, is not justified. The denominator sum of squares then has a noncentral 
^-distribution instead of a central one; and is therefore stochastically larger than 
was assumed (Problem 7.15). It follows that the actual rejection probability is 
less than it would be for X] S 7 fj = 0. This shows that the probability of an error 
of the first kind will not exceed the nominal level of significance, regardless of the 
values of the 7 ’s. However, the power also decreases with increasing X) llj /a 2 
and tends to zero as this ratio tends to infinity. 

The analysis of variance and the associated tests derived in this section for 
two factors extend in a straightforward manner to a larger number of factors (see 
for example Problem 7.16). On the other hand, if the number of observations is 
not the same for each combination of levels (each cell), explicit formulae for the 
least-squares estimators may no longer be available, but there is no difficulty in 
computing these estimators and the associated UMP invariant tests numerically. 
However, in applications it is then not always clear how to define main effects, 
interactions, and other parameters of interest, and hence what hypothesis to test. 
These issues are discussed, for example, in Hocking and Speed (1975) and Speed, 
Hocking, and Hackney (1979). See also TPE2, Chapter 3, Example 4.9, Arnold 
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(1981, Section 7.4), Searle (1987), McCulloch and Searle (2001) and Hocking 
(2003). 

Of great importance are arrangements in which only certain combinations of 
levels occur, since they permit reducing the size of the experiment. Thus for 
example three independent factors, at m levels each, can be analyzed with only 
m 2 observations, instead of the m 3 required if 1 observation were taken at each 
combination of levels, by adopting a Latin-square design (Problem 7.17). 

The class of problems considered here contains as a special case the two-sample 
problem treated in Chapter 5, which concerns a single factor with only two levels. 
The questions discussed in that connection regarding possible inhomogeneities of 
the experimental material and the randomization required to offset it are of equal 
importance in the present, more complex situations. If inhomogeneous material 
is subdivided into more homogeneous groups, this classification can be treated 
as constituting one or more additional factors. The choice of these groups is an 
important aspect in the determination of a suitable experimental design . 8 A very 
simple example of this is discussed in Problems 5.49 and 5.50. 

Multiple comparison procedures for two-way (and higher) layouts are discussed 
by Spjptvoll (1974); additional references can be obtained from Miller (1977b, 
1986) and Westfall and Young (1993). The more general problem of multiple 
testing will be treated in Chapter 9. 


7.6 Regression 

Hypotheses specifying one or both of the regression coefficients a, /3 when 
X \,..., X„ are independently normally distributed with common variance a 2 
and means 

£i = a + PU (7.30) 

are essentially linear hypotheses, as was pointed out in Example 7.1.2. The hy¬ 
potheses Hi : a — ao and H 2 : (3 = (5 0 were treated in Section 5.6, where they 
were shown to possess UMP unbiased tests. We shall now consider Hi and H 2 , 
as well as the hypothesis H 3 : a = cto, P = Po, from the present point of view. 
By the general theory of Section 7.1, the resulting tests will be UMP invariant 
under suitable groups of linear transformations. For the first two cases, in which 
r = 1, this also provides, by the argument of Section 6.6, an alternative proof of 
their being UMP unbiased. 

The space ]~] n is the same for all three hypotheses. It is spanned by the vectors 
(1 ,..., 1) and (ti,... ,tn) and therefore has dimension s = 2 unless the ti are all 


®For a discussion of various designs and the conditions under which they are appro¬ 
priate see, for example, Box, Hunter, and Hunter (1978), Montgomery (2001) and Wu 
and Hamada (2000). Optimum properties of certain designs, proved by Wald, Ehren- 
feld, Kiefer, and others, are discussed by Kiefer (1958), Silvey (1980), Atkinson and 
Donev (1992) and Pukelsheim (1993). The role of randomization, treated for the two- 
sample problem in Section 5.10, is studied by Kempthorne (1955), Wilk and Kempthorne 
(1955), Scheffe (1959), and others; see, for example, Lorenzen (1984) and Giesbrecht and 
Gumpertz (2004). 
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equal, which we shall assume not to be the case. The least-squares estimates a 
and (3 under Q are obtained by minimizing ^(X, — a — /3ti) 2 . For any fixed value 
of f3, this is achieved by the value a = X — fit, for which the sum of squares 
reduces to ^)[(Xj — X) — (3(ti — i)] 2 . By minimizing this with respect to /3 one 
finds 


and 


J2(Xi - xxu - 1) 
E (t 3 - i) 2 


a = X — fit ; 


(7.31) 


E( X > PU ) 2 = ~ Xf - P 2 E& - *)’ 


is the denominator sum of squares for all three hypotheses. The numerator of the 
test statistic (7.7) for testing the two hypotheses a = 0 and to /3 = 0 is Y 2 , and 
for testing a == p = 0 is Y 2 + Y 2 ■ 

For the hypothesis a = 0, the statistic Yi was shown in Example 7.1.3 to be 
equal to 

E * 2 • 

Since then 


X-t 


E tiXi 


T,t 2 _„ 

n E (tj-i) 2 ° 


E (Y i) = 

the hypothesis a = «o is equivalent to the hypothesis 

£(Yi) =Vi = «o-t) 2 /E*? . 
for which the rejection region (7.17) is 


(n — s)(Yi — Vi) 2 / E E 2 >C 0 


i=s + l 


and hence 


' - «o| -t) 2 / E*; 


> C 0 . 


— a — f3ti) 2 /{n - 2 ) 

For the hypothesis P = 0, Yi was shown to be equal to 
EtXi-A'XL-f) 


(7.32) 


x/Efe - *) 2 VEfe ^ 

Since then E(Y\) = py/YLi^i — t) 2 , the hypothesis P = Po is equivalent to 
E(Y\) = rji = /3o-\/Efe — f) 2 and fl ie rejection region is 

l/3-/3o|\/Efe -i) 2 

, ^ = > Lo- 

V E(X* - a - f3ti) 2 / (n - 2) 

For testing a = /3 = 0, it was shown in Example 7.1.3 that 

Yi = /3yEfe -*) 2 , >2 = M = V«(a + /&); 


(7.33) 
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the numerator of (7.7) is therefore 

Y? + Yi _ n(a + pi) 2 + p 2 Z(tj - i) 2 
2 2 

The more general hypothesis a = ao, P = Po is equivalent to E(Y\) = r/i, 
E{Y 2 ) = T) 2 , where r/° = Poy/YjiPi — t) 2 > V2 = y/n(ao + /lot); and the rejection 
region (7.17) can therefore be written as 

\n{a - a 0 ) 2 + 2ni(a - a 0 )(p - Po) + E*i(4 - po) 2 \ /2 

i-S-J- > C. (7.34) 

XXX; - Q - pti) 2 /{n - 2 ) 

The associated confidence sets for (a, p) are obtained by reversing this inequality 
and replacing ao and Po by a and p. The resulting sets are ellipses centered at 
(a,P). 

The simple regression model (7.30) can be generalized in many directions; the 
means may for example be polynomials in t\ of higher than the first degree (see 
Problem 7.20), or more complex functions such as trigonometric polynomials; or 
they may be functions of several variables, ti, Ui, Vi. Some further extensions will 
now be illustrated by a number of examples. 


Example 7.6.1 A variety of problems arise when there is more than one 
regression-line. Suppose that the variables Xij are independently normally 
distributed with common variance and means 


£ij = a; + PiUj [j = 1,, rn\ i = l,...,b). (7.35) 

The hypothesis that these regression lines have equal slopes 

H:p 1 = ---=p b 


may occur for example when the equality of a number of growth rates is to be 
tested. The parameter space ]~[q has dimension s = 2b provided none of the 
sums — ti -) 2 is zero; the number of constraints imposed by the hypothesis 

is r = b — 1. The minimum value of XX(Xy — £i ?) 2 under f 1 is obtained by 
minimizing Xj(X;j — a; — PiUj ) 2 for each i, so that by (7.31), 


' _ Xj(Xjj Xi.)(tij ti.) 

E ^a-ti .) 2 ’ 


ai — Xi . Piti■ 


Under H, one must minimize X X(Xij — ai — ptij) 2 , which for any fixed P leads 
to ai = Xi.—pti. and reduces the sum of squares to X X[(Xij—X;.)—^(Uj—t;.)] 2 . 
Minimizing this with respect to /3, one finds 


Since 


P = 


XE(Xij -Xj.){Uj - up 
EE (tij-u .) 2 


Oti — (3i. 


Xij £ij — X{j CXi fiitij — (Xij Xi -) fii(tij ti -) 


£ij - £ij = (OLi - oti) + tij(0i — p) = (Pi - P)(tij - ti.), 


and 



296 7. Linear Hypotheses 


the rejection region (7.15) is 


Ei(/^i P) ti ) /(& 1) 

E E [(Xij - Xi.) - - t 4 .)] 2 /(n - 2b) 


(7.36) 


where the left-hand side under H has the E-distribution with 6—1 and n — 2b 
degrees of freedom. 

Since 


E0i) = fa and E{fi) 


EiAEj^iJ ti) 

EEfe -ti ) 2 


the noncentrality parameter of the distribution for an alternative set of (3’ s is 
ip 2 — E i(Pi ~ P ) 2 E j{tij ~ ti ) 2 /cr 2 , where fi = E{fi). In the particular case that 
the m and the tij are independent of i, (3 reduces to ^ = E Pi /&• ® 


Example 7.6.2 The regression model (7.35) arises in the comparison of a num¬ 
ber of treatments when the experimental units are treated as fixed and the unit 
effects Uij (defined in Section 5.9) are proportional to known constants tij. Here 
tij might for example be a measure of the fertility of the i , j th piece of land or 
the weight of the i, jih experimental animal prior to the experiment. It is then 
frequently possible to assume that the proportionality factor (3i does not depend 
on the treatment, in which case (7.35) reduces to 

£ij — Or + ptij (7.37) 

and the hypothesis of no treatment effect becomes 

H : a ! = ••• = otb- 


The space ]~[ n coincides with J"| of the previous example, so that s = b + 1 
and 


P = 


EE {Xij - Xi.){tij -u.) 


EE (tij-uy 

Minimization of EE (Xij — a — (3tij) 2 gives 


(Xi — Xj. f3ti 


_ EE (Xjj x..)(tjj t..) 
EE (Uj-t.r 

where X.. = E E Xij/n, t.. = EE Uj/n, n = 
numerator of W* in (7.15) is thus 


a = X.. — fit.., 

n 4 . The sum of squares in the 


EE(4- -1«) 2 = E E - x -)+fas - **-) - kui - *..)] 2 - 

The hypothesis H is therefore rejected when 

EEfc - x..) + fi(tij - u.) -fi(tij - 1..)] /{b — i) 

- L - 7 -i-T 2 - J - > C , (7.38) 

EE {Xij - Xi.) - fi(tij - ti.)\ /(n — 6 — 1) 


where under H the left-hand side has the E-distribution with 6—1 and n — 6 — 1 
degrees of freedom. 
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The hypothesis H can be tested without first ascertaining the values of the 
tij ; it is then the hypothesis of no effect in a one-way classification considered in 
Section 7.3, and the test is given by (7.19). Actually, since the unit effects Uij 
are assumed to be constants, which are now completely unknown, the treatments 
are assigned to the units either completely at random or at random within sub¬ 
groups. The appropriate test is then a randomization test for which (7.19) is an 
approximation. ■ 

Example 7.6.2 illustrates the important class of situations in which an analysis 
of variance (in the present case concerning a one-way classification) is combined 
with a regression problem (in the present case linear regression on the single 
“concomitant variable” t). Both parts of the problem may of course be consid¬ 
erably more complex than was assumed here. Quite generally, in such combined 
problems one can test (or estimate) the treatment effects as was done above, and 
a similar analysis can be given for the regression coefficients. The breakdown of 
the variation into its various treatment and regression components is the so-called 
analysis of covariance. 


7.7 Random-Effects Model: One-way Classification 

In the factorial experiments discussed in Sections 7.3, 7.4, and 7.5, the factor 
levels were considered fixed, and the associated effects (the /i’s in Section 7.3, 
the a’s, /3’s and 7 ’s in Sections 7.4 and 7.5) to be unknown constants. However, 
in many applications, these levels and their effects instead are (unobservable) 
random variables. If all the effects are constant or all random, one speaks of 
fixed-effects model ( model I) or random-effects model (model II) respectively, 
and the term mixed model refers to situations in which both types occur . 9 Of 
course, only the model I case constitutes a linear hypothesis according to the 
definition given at the beginning of the chapter. I 11 the present section we shall 
treat as model II the case of a single factor (one-way classification), which was 
analyzed under the model I assumption in Section 7.3. 

As an illustration of this problem, consider a material such as steel, which is 
manufactured or processed in batches. Suppose that a sample of size n is taken 
from each of s batches and that the resulting measurements X t j (j = 1 
i = 1,..., s) are independently normally distributed with variance o 2 and mean 
£i. If the factor corresponding to i were constant, with the same effect a; in each 
replication of the experiment, we would have 

M + a i ai = °) 

and 

Xij = jl + On + Uij , 

where the Uij are independently distributed as N(0,a 2 ). The hypothesis of no 
effect is £1 ™ • • • = £ s , or equivalently a\ = ■ ■ ■ = a s s= 0. However, the effect is 


®For a recent exposition of random effects models, see Sahai and Ojeda (2004). 
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associated with the batches, of which a new set will be involved in each replication 
of the experiment; the effect therefore does not remain constant. Instead, we shall 
suppose that the batch effects constitute a sample from a normal distribution, 
and to indicate their random nature we shall write A,, for cti, so that 

Xij = ft + At + Uij. (7.39) 

The assumption of additivity (lack of interaction) of batch and unit effect, in the 
present model, implies that the Tl’s and U" 1 s are independent. If the expectation of 
Ai is absorbed into /r, it follows that the A’s and C/’s are independently normally 
distributed with zero means and variances and a 2 respectively. The X’s of 
course are no longer independent. 

The hypothesis of no batch effect, that the A’s are zero and hence constant, 
takes the form 


H : a 2 A = 0 


This is not realistic in the present situation, but is the limiting case of the 
hypothesis 

H{ A 0 ) : 4 < Ac 

(7 Z 


that the batch effect is small relative to the variation of the material within a 
batch. These two hypotheses correspond respectively to the model I hypotheses 
J2 a i = 0 and J^a 2 /cr 2 < Ao . 

To obtain a test of H( Ao) it is convenient to begin with the same transforma¬ 
tion of variables that reduced the corresponding model I problem to canonical 
form. Each set (Xn ,..., Xi„) is subjected to an orthogonal transformation Yij = 
J2k =i c jkXik such that Yu = y/nXi.. Since Ci*, = 1 /yfn for k — 1,..., n (see Ex¬ 
ample 7.1.3), it follows from the assumption of orthogonality that ^T =1 c jk — 0 
for j = 2,..., n and hence that Yij = X]fc=i CjkUik for j > 1. The Yij with j > 1 
are therefore independently normally distributed with zero mean and variance <r 2 . 
They are also independent of Ui. since ( y/nUi. — Y )2 • • • Yi n )' = C(UnUi 2 ... Ui„)' 
(a prime indicates the transpose of a matrix). On the other hand, the variables 
Yn = y/nXi. — y/n{fj, + Ai + Ui.) are also independently normally distributed 
but with mean y/nji and variance a 2 + na\. If an additional orthogonal transfor¬ 
mation is made from (Yn,..., Y s i) to (Z n,..., Z s 1 ) such that Z\\ = y/sY. 1 , the 
Z’s are independently normally distributed with common variance a 2 + na\ and 
means E(Z n) = yfsnfj, and E(Zn) = 0 for i > 1. Putting Zy = Yy for j > 1 for 
the sake of conformity, the joint density of the Z’s is then 


( 2*0 


x exp 


— ns/2 —(n — l)s (2 


(ct 2 + noi) 


-s/2 


(7.40) 


l^cr 2 + na\^J 


z n — y/sn 


+ £*?! -i£E4 

i—2 / i=l j—2 


The problem of testing H{ Ao) is invariant under addition of an arbitrary constant 
to Zn, which leaves the remaining Z’s as a maximal set of invariants. These 
constitute samples of size s(n— 1 ) and s — 1 from two normal distributions with 
means zero and variances cr 2 and r 2 = a 2 + na\. 
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The hypothesis H( Ao) is equivalent to t 2 /a 2 < 1 + Aon, and the problem 
reduces to that of comparing two normal variances, which was considered in 
Example 6.3.4 without the restriction to zero means. The UMP invariant test, 
under multiplication of all Zij by a common positive constant, has the rejection 
region 


where 


W* = 


1 S 2 a /(s - 1 ) 
1 + Aon S 2 /(n—l)s 


> C, 


(7.41) 


s 2 a = J2 and 5,2 

i=2 


s n 


EE 4 


s n 


EE y «' 


The constant C is determined by 


Since 


Fg —l,(n—l)s(//) dy 


and 


n n 

E^-^ 2 i = E^-n^ 

i=i j=i 

S S 

E 7 2 7 2 _ \ ' y 2 V 2 

Ail ~ An — 

i=l i=l 


the numerator and denominator sums of squares of W *, expressed in terms of 
the X’s, become 


Si =«E(4 - X.) 2 and 5 2 = EE(^-^) 2 - 

1=1 i=l j = l 


In the particular case Ao = 0, the test (7.41) is equivalent to the corresponding 
model I test (7.19), but they are of course solutions of different problems, and 
also have different power functions. Instead of being distributed according to a 
noncentral x 2 -distribution as in model I, the numerator sum of squares of W* is 
proportional to a central y 2 -variable even when the hypothesis is false, and the 
power of the test (7.41) against an alternative value of A is obtained from the 
E-distribution through 

r 00 

/1(A) = P A {W* >C}= F s _ h(n _ 1)s (y) dy. 

• /l r^?c 


The family of tests (7.41) for varying Ao is equivalent to the confidence 
statements 


1 r sj/(*-i) 

n CS 2 /(n— l)s 


< A. 


(7.42) 


The corresponding upper confidence bounds for A are obtained from the tests of 
the hypotheses A > Ao. These have the acceptance regions W* > C. where W* 
is given by (7.41) and C' is determined by 


s — 1 ,(n — 1) s 


= 1 — a . 
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The resulting confidence bounds are 

= A. (7.43) 

Both the confidence sets (7.42) and (7.43) are equivariant with respect to the 
group of transformations generated by those considered for the testing problems, 
and hence are uniformly most accurate equivariant. 

When A is negative, the confidence set (A, oo) contains all possible values of 
the parameter A. For small A, this will happen with high probability (1 — a for 
A = 0), as must be the case, since A is then required to be a safe lower bound for 
a quantity which is equal to or near zero. Even more awkward is the possibility 
that A is negative, so that the confidence set (—oo, A) is empty. An interpretation 
is suggested by the fact that this occurs if and only if the hypothesis A > Ao 
is rejected for all positive values of Ao. This may be taken as an indication that 
the assumed model is not appropriate, 10 although it must be realized that for 
small A the probability of the event A < 0 is near a even when the assumptions 
are satisfied, so that this outcome will occasionally be observed. 

The tests of A < Ao and A > Ao are not only UMP invariant but also UMP 
unbiased, and UMP unbiased tests also exist for testing A = Ao against the 
two-sided alternatives A ^ Ao- This follows from the fact that the joint density 
of the Z’s constitutes an exponential family. The confidence sets associated with 
these three families of tests are then uniformly most accurate unbiased (Problem 
7.21). That optimum unbiased procedures exist in the model II case but not in 
the corresponding model I problem is explained by the different structure of the 
two hypotheses. The model II hypothesis = 0 imposes one constraint, since it 
concerns the single parameter <j\. On the other hand, the corresponding model I 
hypothesis jy, a 1 — 0 specifies the values of the s parameters ai ,..., a s , and 
since s — 1 of these are independent, imposes s — 1 constraints. 

A UMP invariant test of A < Ao does not exist if the sample sizes rn are un¬ 
equal. An invariant test with a weaker optimum property for this case is obtained 
by Spjptvoll (1967). 

Since A is a ratio of variances, it is not surprising that the test statistic W* 
is quite sensitive to the assumption of normality; such robustness issues are dis¬ 
cussed in Section 11.3.1). More robust alternatives are discussed, for example, 
by Arvesen and Layard (1975). Westfall (1989) compares invariant variance ratio 
tests in mixed models. 

Optimality of standard F tests in balanced ANOVA models with mixed effects 
is derived in Mathew and Sinha (1988a) and optimal tests in some unbalanced 
designs are derived in Mathew and Sinha (1988b). 


A < — 
n 


Si/{8- 1) 
C'S 2 /(n-l)s 


- 1 


7.8 Nested Classifications 

The theory of the preceding section does not carry over even to so simple a situ¬ 
ation as the general one-way classification with unequal numbers in the different 


10 For a discussion of possibly more appropriate alternative models, see Smith and 
Murray (1984). 
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classes (Problem 7.24). However, the unbiasedness approach does extend to the 
important case of a nested (hierarchical) classification with equal numbers in each 
class. This extension is sufficiently well indicated by carrying it through for the 
case of two factors; it follows for the general case by induction with respect to 
the number of factors. 

Returning to the illustration of a batch process, suppose that a single batch of 
raw material suffices for several batches of the finished product. Let the exper¬ 
imental material consist of ab batches, b coming from each of a batches of raw 
material, and let a sample of size n be taken from each. Then (7.39) becomes 

Yijk ~ jU + Ai A Bij 4“ Uijk (7.44) 

(i = 1 ,..., a; j = l,...,b; fc = l,...,n) 


where Ai denotes the effect of the ith batch of raw material, Bij that of the 
j'tli batch of finished product obtained from this material, and Uijk the effect 
of the fcth unit taken from this batch. All these variables are assumed to be 
independently normally distributed with zero means and with variances a\, a%, 
and a 2 respectively. The main part of the induction argument consists of proving 
the existence of an orthogonal transformation to variables Zijk, the joint density 
of which, except for a constant, is 


exp 


2 a 

(2111 - Vabnfij + ^2 z 2 11 


2 (a 2 + nag + bna\) 

a b 1 a b n 

l 2 l 2 

2(a 2 + na 2 )^^ Zijl 2a 2 ^ 

v B ' i =1 j =2 i =1 j =1 k =2 


(7.45) 


As a first step, there exists for each fixed i, j an orthogonal transformation 
from (Xij 1 ,..., Xij n ) to (Yyi, ■ ■ ■, Yij n ) such that 

Yij 1 — VnXij. — y/n/j, -f- \/ti ( A -f- B /:J -\- Uj : j .). 


As in the case of a single classification, the variables Yijk with k > 1 depend 
only on the U’s, are independently normally distributed with zero mean and 
variance a 2 , and are independent of the Uij.. On the other hand, the variables 
Yij 1 have exactly the structure of the Yij in the one-way classification, 

Yiji = M + Ai + Uij, 

where // = \/nn, A[ = \fnAi , U[j = \/n(Bij + Uij.), and where the variances of 
A'i and U[j are = na\ and a ' 2 = er 2 + na% respectively. These variables can 
therefore be transformed to variables Ziji whose density is given by (7.40) with 
Zij 1 in place of Zij. Putting Zijk = Yijk for k > 1, the joint density of all Zijk is 
then given by (7.45). 

Two hypotheses of interest can be tested on the basis of (7.45)— H 1 : a \/ (a 2 + 
n &%) Y Ao and H 2 : a\/a 2 < Ao. Both state that one or the other of the 
classifications has little effect on the outcome. Let 

a a b a b n 

= fl zKu s 2 b = J2J2 4i. 52 = E E E z i»- 

i =2 i =1 j =2 i =1 j =1 k =2 

To obtain a test of H i, one is tempted to eliminate S 2 through invariance un¬ 
der multiplication of Zijk for k > 1 by an arbitrary constant. However, these 
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transformations do not leave (7.45) invariant, since they do not always pre¬ 
serve the fact that a 2 is the smallest of the three variances a 2 , a 2 + ncr%, and 
a 2 + na% + bna\. We shall instead consider the problem from the point of view 
of unbiasedness. For any unbiased test of Hi, the probability of rejection is a 
whenever <rl/(cr 2 +na%) = Ao, and hence in particular when the three variances 
are a 2 , tq, and (1 + 6nAo)ro for any fixed tq and all a 2 < tq. It follows by 
the techniques of Chapter 4 that the conditional probability of rejection given 
S 2 = s 2 must be equal to a for almost all values of s 2 . With S 2 fixed, the joint 
distribution of the remaining variables is of the same type as (7.45) after the 
elimination of Zm, and a UMP unbiased conditional test given S 2 = s 2 has the 
rejection region 


Wi = 


1 

1 + bnA 0 


S\ {a- 1) 

- —j ->Ci. 

S%/(b-l)a 


(7.46) 


Since S\ and S% are independent of S 2 , the constant Ci is determined by the fact 
that when cri/(cr 2 + na%) = Ao, the statistic W{ is distributed as -Fa-i,(b-i)a 
and hence in particular does not depend on s. The test (7.46) is clearly unbiased 
and hence UMP unbiased. 

An alternative proof of this optimality property can be obtained using Theorem 
6.6.1. The existence of a UMP unbiased test follows from the exponential family 
structure of the density (7.45), and the test is the same whether r 2 is equal to 
a 2 + na% and hence > a 2 , or whether it is unrestricted. However, in the latter 
case, the test (7.46) is UMP invariant and therefore is UMP unbiased even when 
r > u . 

The argument with respect to H 2 is completely analogous and shows the UMP 
unbiased test to have the rejection region 


W 2 


1 

1 + nA 0 


S 2 B /(b-l)a 

- 7 - > C 2 , 

S 2 / (n — l)o6 


(7.47) 


where C 2 is determined by the fact that for crl/cr 2 = Ao, the statistic W% is 
distributed as F (b _ 1 ) ar(n _ 1)ab . 

It remains to express the statistics S\, S%, and S 2 in terms of the A'’s. From 
the corresponding expressions in the one-way classification, it follows that 


si 

si 


- z in = 6 E( Ki -i - k i) 2 > 

i= 1 


E 


E- 


•ijl 


EE^p ■>' 


s 2 - EE 

*=13=1 


E 5 ^ ^ 

k =1 

E EE ( u nk - Ui. 


= EE 


E U ijk - nUfj. 


and 
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Hence 


S 2 A = bnJ2(X i ..-X...) 2 , S£=n££(Xy.- AV) 2 , (7.48) 

s 2 = EEE (Xm - Xij.f. 

It is seen from the expression of the statistics in terms of the Z’s that their 
expectations are E\S\/(a — 1)] = a 2 + na% + bna A , E[S%/(b— l)a] = a 2 + no%, 
and E[S 2 /(n — l)ab] = a 2 . The decomposition 

I] E E^* - x ") 2 = s\ + si + s 2 

therefore forms a basis for the analysis of the variance of Xijk, 

Var(Xijk) = o\ + a% + a 2 

by providing estimates of the components of variance a\, o%, and a 2 , and tests 
of certain ratios of these components. 

Nested two-way classifications also occur as mixed models. Suppose for example 
that a firm produces the material of the previous illustrations in different plants. 
If oh denotes the effect of the ith plant (which is fixed, since the plants do not 
change in the replication of the experiment), Bij the batch effect, and Uijk the 
unit effect, the observations have the structure 

Xijk = P + oh + Bij + Uijk • (7.49) 


Instead of reducing the X’s to the fully canonical form in terms of the Z’s 
as before, it is convenient to carry out only the reduction to the Y’s (such that 
Yij i = y/nXij.) and the first of the two transformations which take the Y’s into 
the Z’s. If the resulting variables are denoted by Wijk, they satisfy Wm = VoY-i, 
Wijk = Yijk for k > 1 and 


a a b a b n 

J2(Wi n - win) 2 = s\, y, w «i = s%, E E E = s 2 - 

i= 1 i =1 j =2 i =1 j=1 k =2 

where S and S 2 are given by (7.48). The joint density of the W’s is, except 
for a constant, 


exp 


2 (a 2 + ncr^) \ z 


E( w in - m - oh ) 2 +E E 

*=1 3 = 2 


2 

l 


(7.50) 


a u ri 

- 5^2 EEE4 


i =1 j =1 k =2 


This shows clearly the different nature of the problem of testing that the plant 
effect is small, 

H : ai = ■ ■ ■ = a a = 0 or H' : E 2 < A 0 , 

o~ + n<Tg 

and testing the corresponding hypothesis for the batch effect: <J^/a 2 < Ao. The 
first of these is essentially a model I problem (linear hypothesis). As before, 
unbiasedness implies that the conditional rejection probability given S 2 = s 2 is 
equal to a a.e. With S 2 fixed, the problem of testing IT is a linear hypothesis, 
and the rejection region of the UMP invariant conditional test given S 2 = s 2 has 
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the rejection region (7.46) with Ao = 0. The constant C\ is again independent of 
S 2 , and the test is UMP among all tests that are both unbiased and invariant. A 
test with the same property also exists for testing H'. Its rejection region is 

Si/(a-l) 

' _ > q' 

Sl/(b-l)a~ 

where C' is determined from the noncentral T’-distribution instead of, as before, 
the (central) T-distribution. 

On the other hand, the hypothesis crg/cr 2 < Ao is essentially model II. It is 
invariant under addition of an arbitrary constant to each of the variables Wm, 
which leaves X^=i 2 W»j’i and X^=i Ej=i 2 ^ijk as maximal invariants, 
and hence reduces the structure to pure model II with one classification. The test 
is then given by (7.47) as before. It is both UMP invariant and UMP unbiased. 

Very general mixed models (containing general type II models as special cases) 
are discussed, for example, by Harville (1978), J. Miller (1977a), and Brown 
(1984), but see the note following Problem 7.36. 

The different one- and two-factor models are discussed from a Bayesian point of 
view, for example, in Box and Tiao (1973) and Broemeling (1985). In distinction 
to the approach presented here, the Bayesian treatment also includes inferences 
concerning the values of the individual random components such as the batch 
means of Section 7.7. 


7.9 Multivariate Extensions 

The univariate linear models studied so far in this chapter arise in the study of the 
effects of various experimental conditions (factors) on a single characteristic such 
as yield, weight, length of life, or blood pressure. This characteristic is assumed 
to be normally distributed with a mean that depends on the various factors under 
investigation, and a variance that is independent of these factors. We shall now 
consider the multivariate analogue of this model, which is appropriate when one 
is concerned with the effect of one or more factors simultaneously on several 
characteristics, for example the effect of a change in the diet of dairy cows on 
both fat content and quantity of milk. 

A random vector (Xi ,..., X p ) has a multivariate normal density if its density 
is of the form 

WWT7 ex P ai i( Xi ~ &)(*i ^ &)1 > ( 7 - 51 ) 

(27t)2 P L J 

where the matrix A = ( aij ) is positive definite, and |A| denotes its determinant. 
The means and covariance matrix of the X’s are given by 

E(X t )=Zi, E(X i -Z i )(X j -Z j ) = a ij , (*«)'= A- 1 . (7.52) 

Such a model was previously introduced in Section 3.9.2. 

Consider now n i.i.d. multivariate normal vectors Xk = (Xk,i ,..., Xk, p ), 
k = 1,..., n, with means E(Xk,i) = £; and covariance matrix A -1 . A natural ex¬ 
tension of the one-sample problem of testing the mean £ of a normal distribution 
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with unknown variance is that of testing the hypothesis 

6 = 6,o, • • • ,6 = 6,o ; 

without loss of generality, assume £k,o = 0 for all k. The joint density of 
,..., X n is 

Min/2 I" j " P P 

(27tW 2 6XP ~2 ■ 

' ' L k= 1 i—1 j=l 

Writing the exponent as 

V V n 

'y ' y ] °i,j y x xk ^—&)( x kj ~ 6 ), 

i =1 j = 1 k =1 

it is seen that the vector of sample means (X\ ,..., X p ) together with 


Sij = — Xi)(Xkj — Xj) , i,j = 1 ,...p (7.53) 

k= 1 

are sufficient for the unknown mean vector £ and unknown covariance matrix E = 
A^ 1 (assumed positive definite). For the remainder of this section, assume n > p, 
so that the matrix S with (i,j) component Sij is nonsingular with probability 
one (Problem 7.38). 

We shall now consider the group of transformations 
X' k = CX k ( C nonsingular) . 

This leaves the problem invariant, since it preserves the normality of the variables 
and their means. It simply replaces the unknown covariance matrix by another 
one. In the space of sufficient statistics, this group induces the transformations 

X* = CX and S * = CSC T , where S = (Sij) . (7.54) 


Under this group, the statistic 

W = X T S _1 X (7.55) 

is maximal invariant (Problem 7.39). 

The distribution of W depends only on the maximal invariant in the parameter 
space; this is found to be 


i= 1 3 = 1 

and the probability density of W is given by (Problem 7.40) 

00 n.i. 2 \k w hp~ 1 + k 

(1 +w)h n + k ' 


p i ,(w)=e ^ c fc 


k\ 


(7.56) 


(7.57) 


This is the same as the density of the test statistic in the univariate case, given as 
(7.6), with r and s there replaced by p. For any ipo < ipi the ratio p^ ± ( w ) /p$ 0 (w) 
is an increasing function of w, and it follows from the Neyman-Pearson Lemma 
that the most powerful invariant test for testing H : 6 = • • • = 6 = 0 rejects 
when W is too large, or equivalently when 


n — p 


W >C. 


P 


(7.58) 
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The quantity (n — 1 )W, which for p = 1 reduces to the square of Student’s t, 
is Hotelling’s T 2 -statistic. The constant C is determined from the fact that for 
ip = 0 the statistic (n — p)W/p has the P-distribution with p and n — p degrees 
of freedom. As in the univariate case, there also exists a UMP invariant test of 
the more general hypothesis H' : ip 2 < tpg, with rejection region W > C'. 

The T 2 -test was shown by Stein (1956) to be admissible against the class of 
alternatives ip 2 > c for any c > 0 by the method of Theorem 6.7.1. Against the 
class of alternatives ip 2 < c admissibility was proved by Kiefer and Schwartz 
(1965) [see Problem 7.44 and Schwartz (1967, 1969)]. 

Most accurate equivariant confidence sets for the unknown mean vector 
(£i,...,£p) are obtained from the UMP invariant test of H : £, = £io 
(i = 1 ,... ,p), which has acceptance region 

n E ^ &°X n - 1 )S i ’ i {X j - &) < C , 

where S 1 ’-’ are the elements of S -1 . The associated confidence sets are therefore 
ellipsoids 

n E Ete - Xi)(n - 1 )S ij &- Xj) < C (7.59) 

centered at (AT,..., A p ). These confidence sets are equivariant under the group of 
transformations considered in this section (Problem 7.41), and by Lemma 6.10.1 
are therefore uniformly most accurate among all equivariant confidence sets at 
the specified level. 

The result extends to the two-sample problem with equal covariances (Problem 
7.43), but the situation becomes more complicated for multivariate generaliza¬ 
tions of univariate linear hypotheses with r > 1. Then, the maximal invariant is 
no longer univariate and a UMP invariant test no longer exists. For a discussion 
of this case, see Anderson (2003), Section 8.10. 


7.10 Problems 


Section 7.1 

Problem 7.1 Expected sums of squares. The expected values of the numerator 
and denominator of the statistic W* defined by (7.7) are 


E 




i= 1 


and 


E 


E 


Li=s+1 



2 

= a . 


Problem 7.2 Noncentral p -distribution . 11 

(i) If X is distributed as N(ip, 1), the probability density of V = A' 2 is Pjf (v) = 
YSk-oPkWhk+iiv), where Pk(ip) = (V’ 2 /2)' s e _(1/2),/ ’ 2 /fc! and where f 2 k+ 1 
is the probability density of a p-vaHable with 2 k + 1 degrees of freedom. 


11 The literature on noncentral y 2 , including tables, is reviewed in Tiku (1985a), Chou, 
Arthur, Rosenstein, and Owen (1994), and Johnson, Kotz and Balakrishnan (1995). 
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(ii) Let Y \,..., Y r be independently normally distributed with unit variance 
and means rji,...,T] r . Then U = ^Y 2 is distributed according to the 
noncentral ^-distribution with r degrees of freedom and noncentrality 
parameter ip 2 = X]i=i Vi i which has probability density 


Pi{u) = ^Pfc(V’)/r+ 2 fc(M)- (7.60) 

k =0 

Here Pk(ip) and f r + 2 k(u) have the same meaning as in (i), so that the 
distribution is a mixture of x 2 -distributions with Poisson weights. 


[(i): This is seen from 


Pi{v) 


e -j(V’ 2 +")(' e V'iA _|_ g-V’v'i') 
2 y/ 2 nv 


by expanding the expression in parentheses into a power series, and using the 
fact that r(2fc) = 2 2k ~ 1 r(k)r{k + \)tsp*. 

(ii): Consider an orthogonal transformation to Z\,...,Z r such that Z\ = 
YlViYi/ip- Then the Z's are independent normal with unit variance and means 
E(Z\ ) = ip and E(Zi) = 0 for i > l.j 


Problem 7.3 Noncentral F- and beta-distribution . 12 Let Y),..., Y r \ Y^+i,..., Y n 
be independently normally distributed with common variance o 2 and means 
E(Yi)= Vi (i= 1,... ,r); E{Y) = 0 (i = s + 1,... ,n). 

(i) The probability density of W = J2i=i / SILs+i ^i is given by (7.6). The 
distribution of the constant multiple (n — s)W/r of W is the noncentral 
F-distribution. 

(ii) The distribution of the statistic B = ^i 2 /(Xw=i + X^iLs+i ^i 2 ) * s 

the noncentral beta-distribution, which has probability density 

OO 

J2 Pk ^ 9 ir+k, §(n-s)0)> ( 7 - 61 ) 

k =0 

where 

9 ^ {b) = YW^) b ^ 1{1 - b)q ~ 1 ’ ( 7 - 62) 

is the probability density of the (central) beta-distribution. 


Problem 7.4 (i) The noncentral x' 2 and F distributions have strictly 

monotone likelihood ratio. 

(ii) Under the assumptions of Section 7.1, the hypothesis H' : ip 2 < ipQ (ipo > 0 
given) remains invariant under the transformations Gi(i = 1,2,3) that 
were used to reduce H : ip — 0, and there exists a UMP invariant test 
with rejection region W > C'. The constant C' is determined by P^ 0 {W > 
C'} = a, with the density of W given by (7.6). 


12 For literature on noncentral F, see Tiku (1985b) and Johnson, Kotz and 
Balakrishnan (1995). 
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[(i): Let f(z) = ^2^L 0 bkZ k / a,kZ k where the constants ak, bk are > 0 and 
J2akZ k and J^bkZ k converge for all z > 0, and suppose that bk/ak < bk+i/ak+i 
for all k. Then 


E E(™ - k)(o.kbn - a n b k )z k+n 1 

f(z) = - f- -V- 

( J2 a kZ k ) 

\k=0 J 


is positive, since ( n — k)(a,kb n — a n bk) > 0 for fc < n, and hence / is increasing.] 
Note. The noncentral x 2 and -^-distributions are in fact STPoo [see for example 
Marshall and Olkin (1979) and Brown, Johnstone and MacGibbon (1981)], and 
there thus exists a test of H : ip = t/to against i/j = ipo which is UMP among all 
tests that are both invariant and unbiased. 


Problem 7.5 Best average power. 

(i) Consider the general linear hypothesis H in the canonical form given by 
(7.2) and (7.3) of Section 7.1, and for any p y+i,... ,p s , a, and p let S = 
S(rj r + i, . • •, r/ s , a : p) denote the sphere {(r?i,..., r) r ) ■ E[=i J 7 i / 0 ' 2 = p 2 }- 
If Pcfrivi,... ,r/ r , a) denotes the power of a test <j> of H, then the test (7.9) 
maximizes the average power 

f s Mi 7i,...,?? r ,o-) dA 

fs dA 

for every r/ r + 1 ,... ,r/ s , a, and p among all unbiased (or similar) tests. Here 
dA denotes the differential of area on the surface of the sphere. 

(ii) The result (i) provides an alternative proof of the fact that the test (7.9) is 
UMP among all tests whose power function depends only on EEi Pi l° 2 ■ 

](i): if U = E[=i E 2 , V = ELs+ 1 ^ 2 ’ unbiasedness (or similarity) implies that 
the conditional probability of rejection given lr+i ,... ,Y S , and U + V equals a 
a.e. Hence for any given 7] r +i, ■ ■ ■, r/ s , cr, and p, the average power is maximized 
by rejecting when the ratio of the average density to the density under H is larger 
than a suitable constant C(y r +i, ■ ■ ■, y 3 , u + v), and hence when 

= J <M > C(y r + 1 ,..., y s ,u + v). 

As will be indicated below, the function g depends on y \,..., y r only through 
u and is an increasing function of u. Since under the hypothesis U/(U + V) 
is independent of Y r + i, ... , Y s and U + V, it follows that the test is given by 
(7.9). The exponent in the integral defining g can be written as EI=i PiVi/^ 2 = 
(py/ucos/3)/a, where [5 is the angle (0 < /3 < n) between (r)i,... ,Tj r ) and 
( 3 / 1 ,..., y r ). Because of the symmetry of the sphere, this is unchanged if (3 is 
replaced by the angle 7 between (r/ 1 ,... ,r/ r ) and an arbitrary fixed vector. This 
shows that g depends on the y’s only through u: for fixed r/i,..., r/ r , a denote it 
by h{u). Let S' be the subset of S in which 0 < 7 < tt/2. Then 

h{u) = J [exp + exp ( ^cos7 ^ ^ 

which proves the desired result.] 
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Problem 7.6 Use Theorem 6.7.1 to show that the P-test (7.7) is a-admissible 
against Q' : ip > ipi for any ip i > 0. 

Problem 7.7 Given any ip 2 > 0, apply Theorem 6.7.2 and Lemma 6.7.1 to 
obtain the P-test (7.7) as a Bayes test against a set SI' of alternatives contained 
in the set 0 < ip < ip 2 - 

Section 7.2 

Problem 7.8 Under the assumptions of Section 7.1 suppose that the means 
are given by 

S 

Pi = 'y ' Qijfij, 

1=1 

where the constants a,ij are known and the matrix A = (dij) has full rank, and 
where the /3j are unknown parameters. Let 9 = Ej=i e iPj be a given linear 
combination of the (3j. 

(i) If $j denotes the values of the (3j minimizing E(-^» — £j) 2 and if 9 = 
E’ =1 e j/3j = EJ=i diXi, the rejection region of the hypothesis H : 9 = 9q 
is 

pL>c„, (7.63) 

yT T - )) /( n-s ) 

where the left-hand side under H has the distribution of the absolute value 
of Student’s t with n — s degrees of freedom. 

(ii) The associated confidence intervals for 9 are 

J2 {Xi ~ii) * J2 ( x i - g) 

9 - k\ -^ < 9 < 9 + k\ - -1 - (7.64) 

\ n — s \ n — s 

with k = CWE d 2 . These intervals are uniformly most accurate 
eciuivariant under a suitable group of transformations. 

[(i): Consider first the hypothesis 9 = 0, and suppose without loss of generality 
that 9 = f3i; the general case can be reduced to this by making a linear trans¬ 
formation in the space of the [3’ s. If a x ,... ,a s denote the column vectors of the 
matrix A which by assumption span Ltn, then £ = /SiOy + • • • +(3 s a s , and since £ is 
in IIf 2 also £ = Bi^ + ■ ■ ■ + f3 s a s . The space II„ defined by the hypothesis (3i = 0 
is spanned by the vectors a 2 ,... ,a s and also by the row vectors c 2 ,... ,c s of the 
matrix C of (7.1), while c x is orthogonal to LU. By (7.1), the vector X is given 
by X_ = E" TjCj, and its projection p on Iln therefore satisfies p = Ei=i Ec,. 
Equating the two expressions for £ and taking the inner product of both sides of 
this equation with Cy gives Li = /3i E"=i a n c ih since the c’s are an orthogonal set 
of unit vectors. This shows that \\ is proportional to /3i and, since the variance of 
Li is the same as that of the X’s , that |li| = |/?i|/\/E df. The result for testing 
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/3r = 0 now follows from (7.12) and (7.13). The test for pi = pi is obtained by 
making the transformation X* = Xi — <nf3\. 

(ii): The invariance properties of the intervals (7.64) can again be discussed with¬ 
out loss of generality by letting 6 be the parameter Pi. In the canonical form of 
Section 7.1, one then has E{Yi) = r/i = A/3i with | A| = 1 /-^/E df while r/ 2 ,..., rj 3 
do not involve Pi. The hypothesis Pi = Pi is therefore equivalent to r/i = rfl, with 
rfl = XPi. This is invariant (a) under addition of arbitrary constants to Y 2 ..., Y a \ 
(b) under the transformations Y* = — (Yi — rfl) + (c) under the scale changes 

Y* = cYi (i = 2,... ,n),Y* — y;?* = c(Yi — r/i). The confidence intervals for 
6 — p\ are then uniformly most accurate equivariant under the group obtained 
from (a), (b), and (c) by varying 771 .] 


Problem 7.9 Let Xy (j = 1,..., ni;) and Yi*, (k = 1,..., n;) be independently 
normally distributed with common variance a 2 and means E(Xij) = and 
E(Yij) = + A. Then the UMP invariant test of El : A = 0 is given by (7.63) 

with 9 = A, do = 0 and 

mj n j 

E =#* (Yi. - Xi .) E Xij + E (Yik - 9) 

n = j ' _ c = J = 1 fc=1 _ 

y' mini > s* > 

i 

where Ni = im + rn. 

Problem 7.10 Let Xi,...,X n be independently normally distributed with 
known variance <jq and means E(Xi) = and consider any linear hypothesis 
with s < n (instead of s < n which is required when the variance is unknown). 
This remains invariant under a subgroup of that employed when the variance was 
unknown, and the UMP invariant test has rejection region 


X(* 

- 1) 2 - (Xi - ii) 2 = (ii - l) 2 > Cal 

(7.65) 

with C determined by 

poo 

/ xl(y)dy = ot. 

JC 

(7.66) 


Section 7.3 

Problem 7.11 If the variables X, j (j = 1,..., n»; i == 1,..., s) are independently 
distributed as N(/j,i,a 2 ), then 

E [Tnj (Xi. - X..) 2 ] = (s - 1 )a 2 + m (/M - Li.) 2 , 

b [EE( x «-* 4 -) 2 ] = («-sv 2 - 


Problem 7.12 Let Zi,...,Z s be independently distributed as N(Q, a 2 ),i = 
1 ,,s, where the ay are known constants. 
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With respect to a suitable group of linear transformations there exists a 
UMP invariant test of H : £1 = • • • = £ s given by the rejection region 

2 


lE( z -- 


E z j/«: 


EV«? 


= £ 


(E Zj/a 


> c 


(7.67) 


(ii) The power of this test is the integral from C to oo of the noncentral 
X 2 -density with s — 1 degrees of freedom and noncentrality parameter A 2 
obtained by substituting Q for Z, in the left-hand side of (7.67). 


Section 7.5 

Problem 7.13 The linear-hypothesis test of the hypothesis of no interaction in 
a two-way layout with m observations per cell is given by (7.28). 

Problem 7.14 In the two-way layout of Section 7.5 with a = b = 2, denote the 
first three terms in the partition of Yl E E (Ejfc — X. tJ . ) 2 by S\, S%, and S 'a B , 
corresponding to the A, 73, and AB effects (i.e. the a’s, /3’s, and y’s), and denote 
by Ha, Hb, and Hab the hypotheses of these effects being zero. Define a new 
two-level factor B' which is at level 1 when A and B are both at level 1 or both 
at level 2, and which is at level 2 when A and B are at different levels. Then 

Hg, = Hab, S b > = Sab, H A b' = H B , S AB > = S B , 

so that the 73-effect has become an interaction, and the .473-interaction the effect 
of the factor 73'. [Shaffer (1977b).] 

Problem 7.15 Let X\ denote a random variable distributed as noncentral 
X 2 with / degrees of freedom and noncentrality parameter A 2 . Then X \/ is 
stochastically larger than A'a if A < A'. 

[It is enough to show that if Y is distributed as 7V(0,1), then (Y + X ') 2 is 
stochastically larger than (Y + A) 2 . The equivalent fact that for any z > 0, 

P{\Y + A'| <z}< P{\Y + A| < z}, 

is an immediate consequence of the shape of the normal density function. An 
alternative proof is obtained by combining Problem 7.4 with Lemma 3.4.2.] 

Problem 7.16 Let Xijk (i = 1 = 1 ,...,b;k = l,...,m) be 

independently normally distributed with common variance a 2 and mean 

E(Xijk) = n + on + /3j + ' o.i = 'y (3j = 'y [ 7 fc = 0 ^ . 

Determine the linear hypothesis test for testing H \ at = ... a a = 0. 

Problem 7.17 In the three-factor situation of the preceding problem, suppose 
that a = b = m. The hypothesis H can then be tested on the basis of m 2 
observations as follows. At each pair of levels (i,j) of the first two factors one 
observation is taken, to which we refer as being in the 7th row and the jth 
column. If the levels of the third factor are chosen in such a way that each 
of them occurs once and only once in each row and column, the experimental 
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design is a Latin square. The m 2 observations are denoted by X'ij(k), where the 
third subscript indicates the level of the third factor when the first two are at 
levels i and j. It is assumed that E(Xij^)) = £ijtk) = A* + on + f3j + 7 *,, with 

E «* = E Pi = E 7fc = 0 . 

(i) The parameters are determined from the £’s through the equations 


£;.(.)=/i +a;, = /* + Pj, C-(fc)=M + 7fc, v-(-) «• /'• 

(Summation over j with i held fixed automatically causes summation also 
over fc.) 

(ii) The least-squares estimates of the parameters may be obtained from the 
identity 

* 3 

= [*<•(•> -*•■(•)-«;] 2 + m yy [*•!(■) - *-(■) - a] 2 

+m [x..(fe) - £..(.) - 7fc] 2 + w- 2 [x..(.) - n ] 2 

+yy yy [%ij(k) — **•(■) ~ _ x -w + 2x..(.)] • 

i fc 


(iii) For testing the hypothesis H \ a. 1 = • • • r= a m = 0, the test statistic W* of 
(7.15) is 


_ ™ E [ E .(.)- X . ( .)] 2 _ 

E E [Xim - X H-) - X H) - X -W + 2X.(.)] 2 /(m - 2)' 


The degrees of freedom are m — 1 for the numerator and (m — 1 )(m — 2) 
for the denominator, and the noncentrality parameter is ijj 2 = a 2 /a 2 . 


Section 7.6 

Problem 7.18 I 11 a regression situation, suppose that the observed values Xj 
and Yj of the independent and dependent variable differ from certain true values 
Xj and Yj by errors Uj , Vj which are independently normally distributed with 
zero means and variances afj and ay. The true values are assumed to satisfy a 
linear relation: Yj = a + f5Xj. However, the variables which are being controlled, 
and which are therefore constants, are the Xj rather than the Xj. Writing Xj for 
Xj , we have Xj = Xj + Uj, Yj = Yj + Vj , and hence Yj = a + /3xj + Wj , where 
Wj = Vj — 1 3Uj. The results of Section 7.6 can now be applied to test that /3 or 
a + (3xo has a specified value. 

Problem 7.19 Let X\,..., X ni ; Yi,...,Y n be independently normally dis¬ 
tributed with common variance a 2 and means E{Xi) = a + f3(ui — u), E(Yj) = 
7 + S(vj — v ), where the u’s and v’s are known numbers. Determine the UMP 
invariant tests of the linear hypotheses H : f3 = S and H : a = 7 , /3 = 5. 

Problem 7.20 Let Xi ,..., X n be independently normally distributed with com¬ 
mon variance a 2 and means £* == a + (3ti + 7 1 2 , where the ti are known. If the 
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coefficient vectors {t\,... ,t^), k = 0 , 1 , 2 , are linearly independent, the parame¬ 
ter space Iln has dimension s = 3, and the least-squares estimates a,/ 3,7 are the 
unique solutions of the system of equations 

<* XX+/ ? XX +1 +o , XX +2 = XX Xi = °’ 1 ’ 2 )- 

The solutions are linear functions of the A'’s, and if 7 = dXi, the hypothesis 
7 = 0 is rejected when 

\j E ( A 'i - a - pti - 7*i) /(« - 3) 

Section 7.7 

Problem 7.21 (i) The test (7.41) of H : A < Ao is UMP unbiased. 

(ii) Determine the UMP unbiased test of 77 : A = Ao and the associated 
uniformly most accurate unbiased confidence sets for A. 

Problem 7.22 In the model (7.39), the correlation coefficient p between two 
observations X,j. Xik belonging to the same class, the so-called intraclass 
correlation coefficient, is given by p — + °' 2 )- 


Section 7.8 

Problem 7.23 The tests (7.46) and (7.47) are UMP unbiased. 

Problem 7.24 If X %] is given by (7.39) but the number n, of observations per 
batch is not constant, obtain a canonical form corresponding to (7.40) by letting 
Y n = ^JniXi.. Note that the set of sufficient statistics has more components than 
when m is constant. 


Problem 7.25 The general nested classification with a constant number of 
observations per cell, under model II, has the structure 


Xijk--- — fa T -)- Bij -f- Cijk + • • • + Uijk- -■, 
* = 1,..., o; j = 1,... ,6; k = 1,... ,c;.... 


(i) 

(ii) 


This can be reduced to a canonical form generalizing (7.45). 
There exist UMP unbiased tests of the hypotheses 


Ha '■ cd...<T 2 B +d. A <r 2 c + ---+a2 - A °’ 

Hb ■■ d...j:l.^ < Ao- 


Problem 7.26 Consider the model II analogue of the two-way layout of Section 
7.5, according to which 

X ijk — p + Ai -(- Bj + Cij + Eijk 

(* = 1,..., a; j = l,...,6; fc = l,...,n), 


(7.68) 
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where the Ai, Bj, Cij, and Eijk are independently normally distributed with 
mean zero and with variances o\ , o%, and 0-2 respectively. Determine tests 
which are UMP among all tests that are invariant (under a suitable group) 
and unbiased of the hypotheses that the following ratios do not exceed a given 
constant (which may be zero): 

(i) 4/^ 2 ; 

(ii) a\/(nal + a 2 ); 

(hi) ob/{ riac + (J 2 )- 

Note that the test of (i) requires n > 1, but those of (ii) and (iii) do not. 

[Let S\ = nbJ2(Xi - - X..) 2 , S% = na£(X.,-. - X...) 2 , S 2 = n££(X,. - 
Xj.. — X.j. + A'...) 2 , S 2 = £££(Xjjfc — Xij.) 2 , and make a transformation 
to new variables Xjfc (independent, normal, and with mean zero except when 
i = j = k = 1) such that 

a b a b 

Q 2 \ ^ 7 2 Q 2 \ A 7 2 C 2 \ ^ \ A 7 2 

~ — 2_^ "ljlj ^C — / j / j ^ijh 

i= 2 j=2 i= 2 j=2 

a b n 

s 2 - EEE z « fc -] 

i=l j = l fc=2 

Problem 7.27 Consider the mixed model obtained from (7.68) by replacing the 
random variables A; by unknown constants at satisfying £ at = 0. With (ii) 
replaced by (ii') £ a 2 /(nac + a2 )> there again exist tests which are UMP among 
an tests that are invariant and unbiased, and in cases (i) and (iii) these coincide 
with the corresponding tests of Problem 7.26. 

Problem 7.28 Consider the following generalization of the univariate linear 
model of Section 7.1. The variables Xi (i = 1,... ,n) are given by A \ + Ui, 

where (Ui ,..., U n ) have a joint density which is spherical, that is, a function of 
£" =1 uf, say 

f(U 1 ,...,U n )=q(j2 U ?)- 

The parameter spaces nn and and the hypothesis H are as in Section 7.1. 

(i) The orthogonal transformation (7.1) reduces (AT,...,X„) to canonical 
variables (Yi,..., Y n ) with Y = r/i + Vi, where rji = 0 for i = s + 1,..., n, 
H reduces to (7.3), and the U’s have joint density q(v i,..., v n ). 

(ii) In the canonical form of (i), the problem is invariant under the groups Gi, 
G 2 , and Gz of Section 7.1, and the statistic W* given by (7.7) is maximal 
invariant. 

Problem 7.29 Under the assumptions of the preceding problem, the null dis¬ 
tribution of W* is independent of q and hence the same as in the normal case, 
namely, F with r and n—s degrees of freedom. [See Problem 5.11]. Note. The anal¬ 
ogous multivariate problem is treated by Kariya (1981); also see Kariya (1985) 
and Kariya and Sinha (1985). For a review of work on spherically and elliptically 
symmetric distributions, see Chmielewski (1981). 
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Problem 7.30 Consider the additive random-effects model 

Xijk — p “I - Ai + Bj -(- Uijk (z 1, - - -, zz, j — 1,..., 5, k — 1,..., zz), 

where the A’s, B’s, and C/’s are independent normal with zero means and 
variances o\, a%, and cr 2 ’ respectively. Determine 

(i) the joint density of the A'’s, 

(ii) the UMP unbiased test of H : a%/a 2 < <5. 

Problem 7.31 For the mixed model 

X%j — p Oil “t - Bj Uij (z — 1, • • •, zz, j — 1,..., zz), 

where the B’s and U’s are as in Problem 7.30 and the a’s are constants adding to 
zero, determine (with respect to a suitable group leaving the problem invariant) 

(i) a UMP invariant test of H : on = ■ ■ ■ = a a \ 

(ii) a UMP invariant test of H : = • • • = £ a = 0 (£; = p + ct;); 

(iii) a test of H : o% /cr 2 < 8 which is both UMP invariant and UMP unbiased. 

Problem 7.32 Let (Xij,... ,X p j), j = 1,... ,n, be a sample from a p-variate 

normal distribution with mean (£i,...,£ p ) and covariance matrix S = (cry), 
where a 2 j = o 2 when j = j, and a 2 j = per 2 when j ^ i. Show that the covariance 
matrix is positive definite if and only if p > — l/(p— 1). 

[For fixed cr and p < 0, the quadratic form (1/cr 2 ) JZ a ijyiVj = ^2 Vi + 
pJ2T,yiVj takes on its minimum value over Y2,Vi = 1 when all the y’s are 
equal.] 

Problem 7.33 Under the assumptions of the preceding problem, determine the 
UMP invariant test (with respect to a suitable G) of £ p . 

[Show that this model agrees with that of Problem 7.31 if p = a 2 /(o 2 +o 2 ), except 
that instead of being positive, p now only needs to satisfy p > — l/(p — 1).] 

Problem 7.34 Permitting interactions in the model of Problem 7.30 leads to 
the model 

X ijk — p “t“ Ai -j- Bj -f- Cij -|- Uijk (z — 1, • • •, j — 1, ■ • •, b, k — 1,..., zz). 

where the A’s, B’s, C’s, and U’s are independent normal with mean zero and 
variances a\, a%, and cr 2 . 

(i) Give an example of a situation in which such a model might be appropriate. 

(ii) Reduce the model to a convenient canonical form along the lines of Section 
7.4. 

(iii) Determine UMP unbiased tests of (a) Hi : o% = 0; (b) H 2 : oc = 0. 

Problem 7.35 Formal analogy with the model of Problem 7.34 suggests the 
mixed model 


Xijk — // + a, + Bj + Cij + Uijk 
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with the £>’s, C s, and U’s as in Problem 7.34. Reduce this model to a canonical 
form involving X... and the sums of squares 

J2(X i ..-X..-a i ) 2 ££(X.j.-X..) 2 

ncr^+cr^ ’ ancr^ ’ 

Y,Y,( X ij- — X i-~ YY 1'Z( X ijk~ X i~ x .j.+ x --) 2 

Problem 7.36 Among all tests that are both unbiased and invariant under 
suitable groups under the assumptions of Problem 7.35, there exist UMP tests of 

(i) Hi : ai = ■ ■ ■ = a a = 0; 

(ii) H 2 : o 2 B /(na 2 c + a 2 ) < C; 

(iii) H 3 : a 2 c /a 2 < C. 

Note. The independence assumptions of Problems 7.35 and 7.36 often are not 
realistic. For alternative models, derived from more basic assumptions, see Scheffe 
(1956, 1959). Relations between the two types of models are discussed in Hocking 
(1973), Cohen and Miller (1976), and Stuart and Ord (1991). 

Problem 7.37 Let ( X\j i,..., X\j n , X 2 j i ,..., X 2 jn , ■ ■ *, Naj l, * * *, A a j n), j — 
1, ...,£>, be a sample from an an-variate normal distribution. Let E(Xijk) = 
£i, and denote by the matrix of covariances of (Xyi ,...,Xij„) with 

(Xi>ji,... ,Xiij n ). Suppose that for all i, the diagonal elements of are = r 2 
and the off-diagonal elements are = pir 2 , and that for i ^ i' all n 2 elements of 
E a' ar e = P2T 2 . 

(i) Find necessary and sufficient conditions on pi and p 2 for the overall abn x 
abn covariance matrix to be positive definite. 

(ii) Show that this model agrees with that of Problem 7.35 for suitable values 
of pi and p 2 . 


Section 7.9 

Problem 7.38 If n < p, the matrix S with (j, j) component Sij defined in 
(7.53) is singular. If n > p, it is nonsingular with probability 1. If n < p, the 
test <j> = a is the only test that is invariant under the group of nonsingular linear 
transformations. 

Problem 7.39 Show that the statistic W given in (7.55) is maximal invariant. 
[Hint: If (X, S) and (Y , T) are such that 

X T s~ 1 x = Y t T~ 1 Y , 

then a transformation C that transforms one to the other is given by C = 
Y(X t S- 1 X)- 1 X t S~ 1 .] 

Problem 7.40 Verify that the density of W is given by (7.55). 

Problem 7.41 The confidence ellipsoids (7.59) for (£i,...,£ P ) are equivariant 
under the group of Section 7.9. 
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Problem 7.42 For testing a multivariate mean vector £ is zero in the case where 
E is known, derive a UMPI test. 

Problem 7.43 Extend the one-sample problem to the two-sample problem for 
testing whether two multivariate normal distributions with common unknown 
covariance matrix have the same mean vectors. 

Problem 7.44 Bayes character and admissibility of Hotelling’s T 2 . 

(i) Let (X Q i,..., A' ap ), a = 1, n, be a sample from a p-variate normal 
distribution with unknown mean £ = (£i,...,£ p ) and covariance matrix 
E = A -1 , and with p < n — 1. Then the one-sample T 2 -test of H : £ = 0 
against K : £ ^ 0 is a Bayes test with respect to prior distributions Ao and 
Ai which generalize those of Example 6.7.13 (continued). 

(ii) The test of part (i) is admissible for testing H against the alternatives 
ip 2 < c for any c > 0. 

[If u) is the subset of points (0, E) of S2 h satisfying S _1 = A + p'p for some fixed 
positive definite p x p matrix A and arbitrary p = (pi, ..., p p ), and fl' A b is the 
subset of points (£, E) of Q.k satisfying E _1 = A + p'p, £' = b’Sp' for the same 
A and some fixed b > 0, let Ao and Ai have densities defined over u> and Q. A ,b, 
respectively by 

Ao (p) = C 0 | A + p’p \~ n ' 2 

and 

\i(p) = Ci\A + p p\~ n ^ 2 exp | [p(A + »/??)“ V] j . 

(Kiefer and Schwartz, 1965).] 

Problem 7.45 Suppose (Xi,...,A p ) have the multivariate normal density 
(7.51), so that E(Xi) = & and A~ 1 is the known positive definite covariance ma¬ 
trix. The vector of means £ = (£i,..., £ p ) is known to lie in a given s-dimensional 
linear space Iln with s < p\ the hypothesis to be tested is that £ lies in a given 
(s — r)-dimensional linear subspace II„ of IIn(r < s). 

(i) Determine the UMPI test under a suitable group of transformations as 
explicitly as possible. Find an expression for the power function. 

(ii) Specialize to the case of a simple null hypothesis. 


7.11 Notes 

The general linear model in the parametric form (7.18) was formulated at the 
beginning of the 19th century by Legendre and Gauss, who were concerned with 
estimating the unknown parameters. [For an account of its history, see Seal 
(1967).] The canonical form (7.2) of the model is due to Kolodziejczyk (1935). 
The analysis of variance, including the concept of interaction, was developed by 
Fisher in the 1920s and 1930s, and a systematic account is provided by Scheffe 
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(1959) in a book that includes a careful treatment of alternative models and of 
robustness questions. 

Different approaches to analysis of variance than that given here are considered 
in Speed (1987) and the discussion following this paper, and in Diaconis (1988, 
Section 8C). Rank tests are discussed in Marden and Muyot (1995). Admissibility 
results for testing homogeneity of variances in a normal balanced one-way layout 
are given in Cohen and Marden (1989). Linear models have been generalized 
in many directions. Loglinear models provide extensions to important discrete 
data. [Both are reviewed in Christensen (2000).] These two classes of models are 
subsumed in generalized linear models discussed for example in McCullagh and 
Nelder (1983), Dobson (1990) and Agresti (2002), and they in turn are a subset 
of additive linear models which are discussed in Hastie and Tibshirani (1990, 
1997). Modern treatments of regression analysis can be found, for example, in 
Weisberg (1985), Atkinson and Riani (2000) and Ruppert, Wand and Carroll 
(2003). UMPI tests can be constructed for tests of lack of fit in some regression 
models; see Christensen (1989) and Miller, Neill and Sherfey (1998). 

Hsu (1941) shows that the test (7.7) is UMP among all tests whose power 
function depends only on the noncentrality parameter. Hsu (1945) obtains a 
result on best average power for the T 2 -test analogous to that of Chapter 7, 
Problem 7.5. 

Tests of multivariate linear hypotheses and the associated confidence sets have 
their origin in the work of Hotelling (1931). More details on these procedures 
and discussion of other multivariate techniques can be found in the comprehensive 
books by Anderson (2003) and Seber (1984). A more geometric approach stressing 
invariance is provided by Eaton (1983). 

For some recent work on using rank tests in multivariate problems, see Choi 
and Marden (1997), Hettmansperger, Mottonen and Oja (1997), and Akritas, 
Arnold and Brunner (1997). 



8 

The Minimax Principle 


8.1 Tests with Guaranteed Power 


The criteria discussed so far, unbiasedness and invariance, suffer from the dis¬ 
advantage of being applicable, or leading to optimum solutions, only in rather 
restricted classes of problems. We shall therefore turn now to an alternative 
approach, which potentially is of much wider applicability. Unfortunately, its 
application to specific problems is in general not easy, unless there exists a UMP 
invariant test. 

One of the important considerations in planning an experiment is the number 
of observations required to insure that the resulting statistical procedure will 
have the desired precision or sensitivity. For problems of hypothesis testing this 
means that the probabilities of the two kinds of errors should not exceed certain 
preassigned bounds, say a and 1 — p, so that the tests must satisfy the conditions 


Egip(X) < a for 9 £ Qh, 
Egip(X) > p for 9 £ O k- 


(8.1) 


If the power function Egip(X) is continuous and if a < p, (8.2) cannot hold when 
the sets Qh and f Ik are contiguous. This mathematical difficulty corresponds in 
part to the fact that the division of the parameter values 9 into the classes Qh 
and f Ik for which the two different decisions are appropriate is frequently not 
sharp. Between the values for which one or the other of the decisions is clearly 
correct there may lie others for which the relative advantages and disadvantages 
of acceptance and rejection are approximately in balance. Accordingly we shall 
assume that is partitioned into three sets 


- Qjh + ffj + flic, 
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of which Qi designates the indifference zone, and Ok the class of parameter values 
differing so widely from those postulated by the hypothesis that false acceptance 
of H is a serious error, which should occur with probability at most 1 — (3. 

To see how the sample size is determined in this situation, suppose that 
X U X 2 , ... constitute the sequence of available random variables, and for a 
moment let n be fixed and let X = (Xi ,..., X n ). In the usual applications (for a 
more precise statement, see Problem 8.1), there exists a test <p„ which maximizes 

inf Eg<p(X) (8.2) 

among all level-a tests based on X. Let /3„ = infn K Eg<p n (X), and suppose that 
for sufficiently large n there exists a test satisfying (8.2). [Conditions under which 
this is the case are given by Berger (1951a) and Kraft (1955).] The desired sample 
size, which is the smallest value of n for which (3 n > /3, is then obtained by trial 
and error. This requires the ability of determining for each fixed n the test that 
maximizes (8.2) subject to 

Egip(X) < a for 9 6 FIh- (8-3) 

A method for determining a test with this maximin property (of maximizing 
the minimum power over Q.k) is obtained by generalizing Theorem 3.8.1. It will be 
convenient in this discussion to make a change of notation, and to denote by u> and 
u/ the subsets of S2 previously denoted by 12 h and VLk- Let V = {Pe , 6 G u> U u/} 
be a family of probability distributions over a sample space (X, A) with densities 
p e = dPg/dp with respect to a cr-finite measure p, and suppose that the densities 
pg(x) considered as functions of the two variables (x, 9) are measurable (A x B) 
and {A x B 1 ), where B and B' are given a-fields over u> and u>'. Under these 
assumptions, the following theorem gives conditions under which a solution of a 
suitable Bayes problem provides a test with the required properties. 


Theorem 8.1.1 For any distributions A and A' over B and B', let ip a , a ' be the 
most powerful test for testing 



Pe{x) dk{6) 


at level a against 



pe{x) dA’(d) 


and let / 3 a,a', be its power against the alternative h'. If there exist A and A' such 
that 


sup Ee<pA,A' (X) < a, 

U) 

mfE e ip AAI ( X ) = / 3 a , a', 

U) 


(8.4) 


then: 


(i) Fa,a' maximizes inf„/ Egip(X) among all level-a tests of the hypothesis 
H : 6 G uj and is the unique test with this property if it is the unique most 
powerful level-a test for testing h against hi. 
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(ii) The pair of distributions A, A' is least favorable in the sense that for 
any other pair v, v' we have 


Pa,A' < Pvy ■ 

Proof, (i): If p* is any other level-a test of H, it is also of level a for testing 
the simply hypothesis that the density of X is h, and the power of ip* against h! 
therefore cannot exceed Pa,a'- It follows that 

inf E e <p*{X) < f E e p*(X)dA'(9) < p A y = inf E 0 <Paa>{X), 

Jcj' 

and the second inequality is strict if v?aa' is unique. 

(ii): Let u, v' be any other distributions over ( w,B ) and (ui',B'), and let 

g(x) = j pe{x)dv(9), g'(x) = J pg{x)dv'{9). 

Since both ip a,a' an d p v y are level-a tests of the hypothesis that g(x) is the 
density of X, it follows that 

pvy > J Va,a '{x)g'(x)dp(x) > inf Eep A ,A'{X) = Pa, a'- ■ 


Corollary 8.1.1 Let A, A' be two probability distributions and C a constant such 
that 

f 1 */ LrPe(x)dA\e) > C f^pg(x)dA(9) 

¥>a,a' (a;) = < 7 */ f w > Pe(x)dA'(0) = C f u p e {x)dA(0) (8.5) 

{ 0 if L'Ps(x)dA'{6) <C J ui p s (x)dA{e) 

is a size-a test for testing that the density of X is f pe(x) dA(9) and such that 

A(u>o) = A'(wo) = 1, (8.6) 

where 


u>o = < 9 : 9 £ u> and Egp>A,A'(X) = sup EgnpA,A'{X) 

l ’ 8'eu> 

= < 9 : 9 £ J and Egp A a'(X) = inf E g itp A a'(X) >. 

( ’ 8'6u' ’ J 

Then the conclusions of Theorem 8.1.1 hold. 

Proof. If h , h', and Pa,a' are defined as in Theorem 8.1.1, the assumptions 
imply that Pa,a' is a most powerful level-a test for testing h against h', that 

supS fl ^A,A'(-T) = / EgpA.ApX) dA(9) = a, 

OJ J OJ 


and that 


inf Eetp a,a'( x ) = [ E 9 (p A , A ,(X) dA'(0) = /? A>A /. 
w Jcj' 

The condition (8.4) is thus satisfied and Theorem 8.1.1 applies. I 
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Suppose that the sets Qh, £2/, and Q.k are defined in terms of a nonnegative 
function d, which is a measure of the distance of 8 from H, by 

Qh = { 8 : d(8) = 0}, = {8 : 0 < d(0) < A}, 

0. K = {0 : d(8) > A}. 

Suppose also that the power function of any test is continuous in 8. In the limit 
as A = 0, there is no indifference zone. Then f Ik becomes the set {8 : d(0) > 0}, 
and the inhmum of /3(8) over Qk is < a for any level-a test. This infimum is 
therefore maximized by any test satisfying f3(8) > a for all 8 £ TIk, that is, 
by any unbiased test, so that unbiasedness is seen to be a limiting form of the 
maximin criterion. A more useful limiting form, since it will typically lead to a 
unique test, is given by the following definition. A test ipo is said to maximize the 
minimum power locally 1 if, given any other test ip, there exists Ao such that 

inf f3 V0 (8) > inf/3 v (0) for all 0 < A < Ao, (8.7) 

A 

where wa is the set of 0’s for which d(8) > A. 


8.2 Examples 

In Chapter 3 it was shown for a family of probability densities depending on a real 
parameter 8 that a UMP test exists for testing H : 8 < 8 o against 8 > 8o provided 
for all 8 < 8' the ratio pgf{x) /pg[x) is a monotone function of some real-valued 
statistic. This assumption, although satisfied for a one-parameter exponential 
family, is quite restrictive, and a UMP test of H will in fact exist only rarely. A 
more general approach is furnished by the formulation of the preceding section. If 
the indifference zone is the set of 0 ’s with 0 o < 0 < 0 i, the problem becomes that 
of maximizing the minimum power over the class of alternatives u/ : 0 > 0i. Under 
appropriate assumptions, one would expect the least favorable distributions A and 
A' of Theorem 8.1.1 to assign probability 1 to the points 0o and 0i, and hence 
the maximin test to be given by the rejection region pg 1 (*) /pg 0 (x) > C. The 
following lemma gives sufficient conditions for this to be the case. 


Lemma 8.2.1 Let X\, ..., X n be identically and independently distributed with 
probability density fg(x), where 0 and x are real-valued , and suppose that for any 
8 < 8' the ratio fg’(x)/fg(x) is a nondecreasing function of x. Then the level-a 
test ip of H which maximizes the minimum power over oj' is given by 



r i 

If 

r(x i,.. 

• , Xn) 

>C, 



•,*i) = < 7 

if 

r(x i,.. 

• , Xn) 

= C, 

( 8 . 8 ) 


l o 

if 

r(x i,.. 

• , Xn) 

<C, 



where r(x i,... ,*„) = fg 1 (x i)... fg 1 (x n )/fe 0 {x i)... fg 0 (xn) and where C and 7 
are determined by 


Eg 0 <p(Xi ,..., X n ) = a. 


(8.9) 


1 A different definition of local minimaxity is given by Giri and Kiefer (1964). 
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Proof. The function p(xi ,..., x n ) is nondecreasing in each of its arguments, so 
that by Lemma 3.4.2, 

Egip(X_ i,..., X n ) < E g np(X i,..., X n ) 

when 8 < O'. Hence the power function of tp is monotone and tp is a level-a test. 
Since ip = Pa,A', where A and A' are the distributions assigning probability 1 
to the points 9o and 6 1 , the condition (8.4) is satisfied, which proves the desired 
result as well as the fact that the pair of distributions (A, A') is least favorable. 


Example 8.2.1 Let 8 be a location parameter, so that fe(x) = g(x — 8), and 
suppose for simplicity that g(x) > 0 for all x. We will show that a necessary 
and sufficient condition for fe(x) to have monotone likelihood ratio in x is that 
— logp is convex. The condition of monotone likelihood ratio in x, 


g(x - O') < g{x' - 9') 
g(x -9) ~ g(x' - 9) 


for all x < x', 9 < 9 ', 


is equivalent to 


log g{x' -9) + log g(x - 9') < log g(x - 9) + log g{x - 9'). 


Since x—9 = t(x—9') + (l — t)(x' — 9) and x' — 9' = (1 — t)(x~9')+t(x' — 9), where 
t = (x' — x)/(x' — x + 9' — 9), a sufficient condition for this to hold is that the 
function — log g is convex. To see that this condition is also necessary, let a < b 
be any real numbers, and let x — 9' = a, x' — 9 = 6, and x' — 9' = x — 9. Then 
x — 9 = ^(x' — 9 + x — 8') = \ (a + b), and the condition of monotone likelihood 
ratio implies 


| log g(a) + log g(b)] < log g [§ (a + 6 )]. 

Since logp is measurable, this in turn implies that — log g is convex . 2 

A density g for which — log g is convex is called strongly unimodal. Basic prop¬ 
erties of such densities were obtained by Ibragimov (1956). Strong unimodality 
is a special case of total positivity. A density of the form g(x — 9) which is totally 
positive of order r is said to be a Polya frequency function of order r. It follows 
from Example 8.2.1 that g(x — 9) is a Polya frequency function of order 2 if and 
only if it is strongly unimodal. [For further results concerning Polya frequency 
functions and strongly unimodal densities, see Karlin (1968), Marshall and Olkin 
(1979), Huang and Ghosh (1982), and Loh (1984a, b).] 

Two distributions which satisfy the above condition [besides the normal dis¬ 
tribution, for which the resulting densities pg(xi,... ,x n ) form an exponential 
family] are the double exponential distribution with 

g(x) = §e~ N 

and the logistic distribution, whose cumulative distribution function is 


so that the density is g(x) = e x /(l + e x ) 2 . ■ 


2 See Sierpinski (1920). 
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Example 8.2.2 To consider the corresponding problem for a scale parameter, 
let fg(x) = 9~ 1 h(x/9) where h is an even function. Without loss of generality one 
may then restrict x to be nonnegative, since the absolute values |Xi|,..., \X n \ 
form a set of sufficient statistics for 9. If Yj = logA'i and r] = log 9, the density 
of Yi is 


h.(e y ~ v )e y ~ v 


By Example 8.2.1, if h(x) > 0 for all x > 0, a necessary and sufficient condi¬ 
tion for fg'(x)/fe(x) to be a nondecreasing function of x for all 9 < 9' is that 
— log [e v h(e y )\ or equivalently — \ogh(e y ) is a convex function of y. An example 
in which this holds—in addition to the normal and double-exponential distribu¬ 
tions, where the resulting densities form an exponential family—is the Cauchy 
distribution with 


h(x) 


1 1 
7T 1 + X 2 


Since the convexity of — log h(y) implies that of — log h(e y ), it follows that if 
h is an even function and h(x — 9) has monotone likelihood ratio, so does h(x/9). 
When h is the normal or double-exponential distribution, this property of h(x/9) 
also follows from Example 8.2.1. That monotone likelihood ratio for the scale- 
parameter family does not conversely imply the same property for the associated 
location parameter family is illustrated by the Cauchy distribution. The condition 
is therefore more restrictive for a location than for a scale parameter. ■ 


The chief difficulty in the application of Theorem 8.1.1 to specific problems 
is the necessity of knowing, or at least being able to guess correctly, a pair of 
least favorable distributions (A, A'). Guidance for obtaining these distributions 
is sometimes provided by invariance considerations. If there exists a group G 
of transformations of X such that the induced group G leaves both u and a/ 
invariant, the problem is symmetric in the various 9 's that can be transformed 
into each other under G. It then seems plausible that unless A and A' exhibit the 
same symmetries, they will make the statistician’s task easier, and hence will not 
be least favorable. 


Example 8.2.3 In the problem of paired comparisons considered in Exam¬ 
ple 6.3.5, the observations A \ (i = 1are independent variables taking 
on the values 1 and 0 with probabilities p; and qi = 1 — pi . The hypothesis H 
to be tested specifies the set c o : max pi < |. Only alternatives with Pi > \ 
for all i are considered, and as u/ we take the subset of those alternatives for 
which max pi > | + S. One would expect A to assign probability 1 to the point 
Pi = • • • p n = |, and A' to assign positive probability only to the n points 
(pi,... ,p n ) which have n — 1 coordinates equal to | and the remaining coordi¬ 
nate equal to | + <5. Because of the symmetry with regard to the n variables, it 
seems plausible that A' should assign equal probability 1/n to each of these n 
points. With these choices, the test <Pa,a' rejects when 


E 



x i 

> C. 


This is equivalent to X]"=i x i > C, which had previously been seen to be UMP 
invariant for this problem. Since the critical function Pa,a' ( xi , ■ ■ ■, x n ) is nonde- 
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creasing in each of its arguments, it follows from Lemma 3.4.2 that pi < p\ for 
i = 1,..., n implies 

E’p 1 Pa,A' (Al ; ■ , X n ) ^ S p / . <PA,A' ( Ad, - - * , A"ti) 

and hence the conditions of Theorem 8.1.1 are satisfied. ■ 


Example 8.2.4 Let X = (Xi,..., X n ) be a sample from N(£, a 2 ), and consider 
the problem of testing H : a = ao against the set of alternatives u/ : a < o\ or a > 
C 2 (<ri < a o < 02 ). This problem remains invariant under the transformations 
X[ = Xi+c, which in the parameter space induce the group G of transformations 
£' = £ + c, (/ = < 7 . One would therefore expect the least favorable distribution A 
over the line u> : —00 < £ < 00 , a = <ro, to be invariant under G. Such invariance 
implies that A assigns to any interval a measure proportional to the length of 
the interval. Hence A cannot be a probability measure and Theorem 8.1.1 is 
not directly applicable. The difficulty can be avoided by approximating A by 
a sequence of probability distributions, in the present case for example by the 
sequence of normal distributions 1V(0, k), k = 1,2, .... 

In the particular problem under consideration, it happens that there also exist 
least favorable distributions A and A', which are true probability distributions 
and therefore not invariant. These distributions can be obtained by an exami¬ 
nation of the corresponding one-sided problem in Section 3.9, as follows. On w, 
where the only variable is £, the distribution A of £ is taken as the normal dis¬ 
tribution with an arbitrary mean £1 and with variance (a 2 — u 2 )/n. Under A' all 
probability should be concentrated on the two lines a = a 1 and a = 02 in the 
(£,<r) plane, and we put A' = pA^ + qA' 2 , where Aj is the normal distribution 
with mean £1 and variance (a 2 — a 2 )/n, while A 2 assigns probability 1 to the 
point (£ 1 , 02 ). A computation analogous to that carried out in Section 3.9 then 
shows the acceptance region to be given by 


■ exp 


—^ Yj( Xi - x ) 2 -^2(x- £i) 5 


q 

+ — exp 

2^2 ~ - 5 1 ) 2 } 


1 

* )2 _ 


71 —i 

<r 0 a 2 


< C 


which is equivalent to 


Ci < Y.( Xi -*) 2 ^ ° 2 - 

The probability of this inequality is independent of £, and hence C\ and C 2 can 
be determined so that the probability of acceptance is 1 — a when a = ao, and 
is equal for the two values a = <n and a = og. 

It follows from Section 3.7 that there exist p and C which lead to these values 
of C 1 and C 2 and that the above test satisfies the conditions of Corollary 8.1.1 
with wo = w, and with u>' 0 consisting of the two lines o = and a = a ■ 
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8.3 Comparing Two Approximate Hypotheses 


As in Section 3.2, let Po y Pi be two distributions possessing densities po and 
pi with respect to a measure p. Since distributions even at best are known only 
approximately, let us assume that the true distributions are approximately Po or 
Pi in the sense that they lie in one of the families 

Pi = {Q : Q = (1 - £i)P + etGi}, i = 0,1, (8.10) 

with eo, ei given and the G; arbitrary unknown distributions. We wish to find 
the level-Q test of the hypothesis H that the true distribution lies in Po, which 
maximizes the minimum power over Pi. This is the problem considered in Section 
8.1 with 8 indicating the true distribution, Qh = Po, and LIk — Pi. 

The following theorem shows the existence of a pair of least favorable dis¬ 
tributions A and A' satisfying the conditions of Theorem 8.1.1, each assigning 
probability 1 to a single distribution, A to Qo £ Po and A' to Q i £ Pi, and 
exhibits the Qi explicitly. 


Theorem 8.3.1 Let 


qo(x) 


<?ip) 


l-e 0 )pop) if gtM < 5 , 

(i— e o)pi ( x ) piM > ^ 

b J Po(cc) — ’ 

(l-ei)pi(a;) if > a, 

o(l - ei)po(ar) if < a. 


( 8 . 11 ) 


(i) For all 0 < a < 1, there exist unique constants a and b such that qo and qi 
are probability densities with respect to p; the resulting qi are members of 

Pi (i = 0, 1). 


(ii) There exist Jo, Ji such that for all a < 6i the constants a and b satisfy a < b 
and that the resulting qo and qi are distinct. 


(iii) If a < 8i for i = 0,1, the families Po and Pi are nonoverlapping and the 
pair (qo,qi) is least favorable, so that the maximin test of Po against Pi 
rejects when qi(x)/qo{x) is sufficiently large. 


Note. Suppose a < b, and let 


Then 


r(x) 


Pi{x) 

Po(x)’ 


r*(x) = 


r* p) 

_ lip) 

, and k = ■ 

9o P) 

ka 

when 

rp) < a, 

kr(x) 

when 

a < rp) < b, 

kb 

when 

b < rp). 


1 - 


£1 
eo ' 


( 8 . 12 ) 


The maximin test thus replaces the original probability ratio with a censored 
version. 


Proof. The proof will be given under the simplifying assumption that Pop) and 
Pip) are positive for all x in the sample space. 
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(i): For qi to be a probability density, a must satisfy the equation 

Pi[r{X) > a] + aP 0 [r{X ) < a] = ——. (8.13) 

1 — ei 

If (8.13) holds, it is easily checked that qi £ Vi (Problem 8.12). To prove existence 
and uniqueness of a solution a of (8.13), let 

7 (c) = Pi[r(X) > c] + cP 0 [r(X) < c). 


Then 


7 ( 0 ) = 1 and 

7 (c) —y 00 as c —» 00 . 

(8.14) 

Furthermore (Problem 8.14) 



7 (c + A) — 7 (c) = A / 

Po(x) dp(x) 

(8.15) 

J r( 

[x)< c 


+ ( 

[c + A — r(x)]po(x) dp(x). 


J c<r(x)<c-\- A 



It follows from (8.15) that 0 < 7(0 +A) — 7 (c) < A, so that —7 is continuous and 
nondecreasing. Together with (8.14) this establishes the existence of a solution. 
To prove uniqueness, note that 


7 (c + A) — 7 (c) > A 
and that 7 (c) = 1 for all c for which 


r(x)<c 


Po(x) dfi(x) 


(8.16) 


Pi[r(x)<c} = 0 (i = 0,1). (8.17) 

If Co is the supremum of the values for which (8.17) holds, (8.16) shows that 7 
is strictly increasing for c > Co and this proves uniqueness. The proof for b is 
exactly analogous (Problem 8.13). 

(ii) : As ei —> 0, the solution a of (8.13) tends to Co- Analogously, as e\ —> 0, 
b —> 00 (Problem 8.13). 

(iii) : This will follow from the following facts: 


(a) When X is distributed according to a distribution in Vo, the statistic r *(. X ) 
is stochastically largest when the distribution of X is Qo- 

(b) When X is distributed according to a distribution in V\, r*(X) is 
stochastically smallest for Q 1 . 

(c) r*(X) is stochastically larger when the distribution of A' is Qi than when 
it is Q 0 . 


These statements are summarized in the inequalities 

Q'o[r*{X) <t}> Qo[r* (A) < t] > Qi[r*(X) <t]> Q[[r*(X) < t\ (8.18) 
for all t and all Q[ £ V, . 

From (8.12), it is seen that (8.18) is obvious when t < ka or t > kb. Suppose 
therefore that ak < t < bk, and denote the event r*(X) <tbyE. Then Qo(E) > 
(1 — eo)Po(E) by (8.10). But r*(x) < t < kb implies c(A') < b and hence Qo(E) = 
(1 — e)Po(E). Thus Qo(E) > Qo(E), and analogously Qi(E) < Q\(E). Finally, 
the middle inequality of (8.18) follows from Corollary 3.2.1. 
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If the e’s are sufficiently small so that Q o ^ Q 1 , it follows from (a)-(c) that 
Vo and Vi are nonoverlapping. 

That (Qo, Qi) is least favorable and the associated test ip is maximin now 
follows from Theorem 8.1.1, since the most powerful test ip for testing Qo against 
Q\ is a nondecreasing function of q\ (X)/qo(X). This shows that E<p(X) takes on 
its sup over Vo at Q o and its inf over Vi at Q i, and this completes the proof. ■ 

Generalizations of this theorem are given by Huber and Strassen (1973, 1974). 
See also Rieder (1977) and Bednarski (1984). An optimum permutation test, with 
generalizations to the case of unknown location and scale parameters, is discussed 
by Lambert (1985). 

When the data consist of n identically, independently distributed random vari¬ 
ables AT,..., X n , the neighborhoods (8.10) may not be appropriate, since they 
do not preserve the assumption of independence. If Pi has density 

Pi(xi, . . . ,X n ) = fi{xi) . . .fi{x n ) (1 = 0,1), (8.19) 

a more appropriate model approximating (8.19) may then assign to X = 
(Xi,...,X n ) the family V* of distributions according to which the Xj are 
independently distributed, each with distribution 

(1 - u)Fi(xj) + tiGi(xj), (8.20) 

where Fi has density fi and where as before the Gi are arbitrary. 


Corollary 8.3.1 Suppose qo and qi defined by (8.11) with x — Xj satisfy 
(8.18) and hence are a least favorable pair for testing Vo against Vi on the ba¬ 
sis of the single observation Xj. Then the pair of distributions with densities 
qi(xi)... qi(xn) (i = 0, 1) is least favorable for testing Vo against V(, so that the 
maximin test is given by 


p{xi,.. .,x n ) 



1 

7 

0 


n 




'qificjY 
. do (Xj ) 


> 


( 8 . 21 ) 


Proof. By assumption, the random variables Y) = qi(Xj)/qo(Xj) are stochasti¬ 
cally increasing as one moves successively from Q' 0 £ Vo to Qo to Q\ to Q[ €Vi- 
The same is then true of any function ip(Yi,.... ,Y n ) which is nondecreasing in 
each of its arguments by Lemma 3.4.1, and hence of ip defined by (8.21). The 
proof now follows from Theorem 8.3.1. ■ 

Instead of the problem of testing Po against Pi, consider now the situation 
of Lemma 8.2.1 where H : 6 < 6q is to be tested against 9 > 9\ (9o < 9\) 
on the basis of n independent observations Xj, each distributed according to a 
distribution Fo(xj) whose density fe(xj) is assumed to have monotone likelihood 
ratio in Xj. 

A robust version of this problem is obtained by replacing Fg with 

(1 - e)F e (xj) + eG(xj), j = l,...,n, (8.22) 

where e is given and for each 9 the distribution G is arbitrary. Let Vq* and V(* 
be the classes of distributions (8.22) with 9 < 9q and 9 > 9\ respectively; and 
let Vq and V* be defined as in Corollary 8.3.1 with /g 4 in place of fi. Then the 
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maximin test (8.21) of Vq against V( retains this property for testing Vq* against 

Pi*. 

This is proved in the same way as Corollary 8.3.1, using the additional fact 
that if Fg> is stochastically larger than Fg, then (1 — e)Fg / + eG is stochastically 
larger than (1 — e)Fg + eG. 


8.4 Maximin Tests and Invariance 

When the problem of testing £2# against £2x remains invariant under a certain 
group of transformations, it seems reasonable to expect the existence of an invari¬ 
ant pair of least favorable distributions (or at least of sequences of distributions 
which in some sense are least favorable and invariant in the limit), and hence 
also of a maximin test which is invariant. This suggests the possibility of bypass¬ 
ing the somewhat cumbersome approach of the preceding sections. If it could be 
proved that for an invariant problem there always exists an invariant test that 
maximizes the minimum power over £2/c, attention could be restricted to invari¬ 
ant tests; in particular, a UMP invariant test would then automatically have 
the desired maximin property (although it would not necessarily be admissible). 
These speculations turn out to be correct for an important class of problems, 
although unfortunately not in general. To find out under what conditions they 
hold, it is convenient first to separate out the statistical aspects of the problem 
from the group-theoretic ones by means of the following lemma. 

Lemma 8.4.1 Let V = {Pg,9 £ £2} be a dominated family of distributions on 
(X,A), and let G be a group of transformations of (X,A), such that the induced 
group G leaves the two subsets £2# and £2x of £2 invariant. Suppose that for any 
critical function ip there exists an (almost) invariant critical function ip satisfying 

inf Eggip(X) < Egip(X) < sup Eggip(X) (8.23) 

G G 

for all 9 £ £2. Then if there exists a level-a test ip o maximizing infn fc Egip(X), 
there also exists an (almost) invariant test with this property. 

Proof. Let infn K Egip o(V) = /3, and let ipo be an (almost) invariant test such 
that (8.23) holds with ip = po, ip = ipo- Then 

Egipo(X) < sup Eggipo(X) < a for all 9 £ £2 h 
g 

and 

Egtpo{X) > inf Eggipo(X) > (3 for all 9 £ £2x, 
as was to be proved. ■ 

To determine conditions under which there exists an invariant or almost in¬ 
variant test ip satisfying (8.23), consider first the simplest case that G is a finite 
group, G = {g i,. .., grjv} say. If ip is then defined by 

1 N 

V’(z) = 

V 2 = 1 


(8.24) 
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it is clear that ip is again a critical function, and that it is invariant under G. It 
also satisfies (8.23), since Eg^p(gX) = Ege<p(X) so that Egip(X) is the average of 
a number of terms of which the first and last member of (8.23) are the minimum 
and maximum respectively. 

An illustration of the finite case is furnished by Example 8.2.3. Here the prob¬ 
lem remains invariant under the n! permutations of the variables {X\,... , X n ) . 
Lemma 8.4.1 is applicable and shows that there exists an invariant test max¬ 
imizing iiifn^ Egip(X). Thus in particular the UMP invariant test obtained in 
Example 6.3.5 has this maximin property and therefore constitutes a solution of 
the problem. 

It also follows that, under the setting of Theorem 6.3.1, the UMPI test given 
by (6.9) is maximin. 

The definition (8.24) suggests the possibility of obtaining ip(x) also in other 
cases by averaging the values of <p(gx) with respect to a suitable probability 
distribution over the group G. To see what conditions would be required of this 
distribution, let B be a cr-field of subsets of G and v a probability distribution over 
( G,B ). Disregarding measurability problems for the moment, let ip be defined by 

ip(x) = j <p{gx)dis(g). (8.25) 

Then 0 < ip < 1, and (8.23) is seen to hold by applying Fubini’s theorem (The¬ 
orem 2.2.4) to the integral of ip with respect to the distribution Pg. For any 
go £ G, 

i>(9ox) = J ifi{gg 0 x)du(g) = J ip(hx) dv*(h) , 

where h = ggo and where u* is the measure defined by 

v*(B) = v{Bgo 1 ) for all B £ B, 

into which v is transformed by the transformation h = ggo. Thus ip will have the 
desired invariance property, ip(gox) = ip{x) for all go £ G, if v is right invariant , 
that is, if it satisfies 

v(Bg) = zz(-B) for all B £ B, g £ G. (8.26) 

Such a condition was previously used in (6.16). 

The measurability assumptions required for the above argument are: (i) For 
any A £ A, the set of pairs (x, g) with gx £ A is measurable (A x B). This insures 
that the function ip defined by (8.25) is again measurable, (ii) For any B £ B, 
g £ G, the set Bg belongs to B. 

Example 8.4.1 If G is a finite group with elements pi,..., piv, let B be the class 
of all subsets of G and v the probability measure assigning probability 1/IV to 
each of the N elements. The condition (8.26) is then satisfied, and the definition 
(8.25) of ip in this case reduces to (8.24). ■ 

Example 8.4.2 Consider the group G of orthogonal nx n matrices T, with the 
group product riT 2 defined as the corresponding matrix product. Each matrix 
can be interpreted as the point in n 2 -dimensional Euclidean space whose coordi¬ 
nates are the n 2 elements of the matrix. The group then defines a subset of this 
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space; the Borel subsets of G will be taken as the cr-field B. To prove the existence 
of a right invariant probability measure over (G, B), we shall define a random or¬ 
thogonal matrix whose probability distribution satisfies (8.26) and is therefore 
the required measure. With any nonsingular matrix x = ( Xij ), associate the or¬ 
thogonal matrix y = f(x) obtained by applying the following Gram-Schmidt 
orthogonalization process to the n row vectors Xi = (xn ,..., Xi n ) of x : y\ is the 
unit vector in the direction of xi; y -2 the unit vector in the plane spanned by x\ 
and X 2 which is orthogonal to y i and forms an acute angle with X 2 ', and so on. 
Let y = (yij) be the matrix whose *th row is yi. 

Suppose now that the variables Xy ( i , j = 1 are independently dis¬ 

tributed as X(0,1), let A' denote the random matrix (Xij), and let Y = f(X). 
To show that the distribution of the random orthogonal matrix Y satisfies 
(8.26), consider any fixed orthogonal matrix T and any fixed set B £ B. Then 
P{Y £ -BT} = P{YT' £ B} and from the definition of / it is seen that 
IT' = /(AT'). Since the n 2 elements of the matrix XT' have the same joint 
distribution as those of the matrix A, the matrices f(XT') and f(X) also have 
the same distribution, as was to be proved. ■ 

Examples 8.4.1 and 8.4.2 are sufficient for the applications to be made here. 
General conditions for the existence of an invariant probability measure, of which 
these examples are simple special cases, are given in the theory of Haar measure. 
[This is treated, for example, in the books by Halmos (1974), Loomis (1953), and 
Nachbin (1965). For a discussion in a statistical setting, see Eaton (1983, 1989), 
Farrell (1985a), and Wijsman (1990), and for a more elementary treatment Berger 
(1985a).] 


8.5 The Hunt-Stein Theorem 

Invariant measures exist (and are essentially unique) for a large class of groups, 
but unfortunately they are frequently not finite and hence cannot be taken to be 
probability measures. The situation is similar and related to that of the nonexis¬ 
tence of a least favorable pair of distributions in Theorem 8.1.1. There it is usually 
possible to overcome the difficulty by considering instead a sequence of distribu¬ 
tions which has the desired property in the limit. Analogously we shall now 
generalize the construction of tp as an average with respect to a right-invariant 
probability distribution, by considering a sequence of distributions over G which 
are approximately right-invariant for n sufficiently large. 

Let V = {Pe,9 £ f!} be a family of distributions over a Euclidean space (X , .4) 
dominated by a a-finite measure y, and let G be a group of transformations of 
(X,A) such that the induced group G leaves Q invariant. 

Theorem 8.5.1 (Hunt—Stein.) Let B be a a-field of subsets of G such that for 
any A £ A the set of pairs (x, g) with gx £ A is in A x B and for any B £ B 
and g £ G the set Bg is in B. Suppose that there exists a sequence of probability 
distributions v n over ( G , B) which is asymptotically right-invariant in the sense 
that for any g £ G, B £ B, 

lim \i/ n (Bg) - v n (B)\ = 0. 


(8.27) 
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Then given any critical function ip, there exists a critical function ip which is 
almost invariant and satisfies (8.23). 


Proof. Let 



(fi(gx) dv n (g), 


which as before is measurable and between 0 and 1. By the weak compactness 
theorem (Theorem A.5.1 of the Appendix) there exists a subsequence {ip ni } and 
a measurable function ip between 0 and 1 satisfying 


lim 

i — o o 


J ip ni pdp = j ippdp 


for all /r-integrable functions p, so that in particular 


lim Egip ni (X) = E e ip(X) 

i—> oo 

for all 6 £ Q. By Fubini’s theorem, 

Eeipm(X) = J [Egip(gX)\ dv ni (g) = J E s gip(X) dv ni (g) , 

so that 


inf Eggip(X) < Egip ni (X) < supE s eg>{X), 

G G 

and ip satisfies (8.23). 

In order to prove that ip is almost invariant we shall show below that for all x 
and g, 


1p ni {gx) - 1 p ni (x) -» 0. 


(8.28) 


Let Ia(x) denote the indicator function of a set A £ A. Using the fact that 
IgA^gx ) = Ia(x), we see that (8.28) implies 


/ ip(x) dPg(x) = lim ip ni (x)I A (x) dPg(x) 

JA l^ooj 

= lim / ip ni (gX)I gA {gx) dPg(x) 

i —>oo J 

= J ip(x)I g A(x) dP s o(x) = J ip(gx) dPg(x) 


and hence ip{gx) = ip(x) (a.e. V), as was to be proved. 

To prove (8.28), consider any fixed x and any integer m, and let G be 
partitioned into the mutually exclusive sets 


Bk 


| h £ G : at < p(hx) < at + 



k .= 0,..., m, 


where ak = (k — 1 )/m. In particular, Bo is the set {h £ G : ip(hx) = 0}. It is seen 
from the definition of the sets Bk that 


^2 akiy ni (Bk) < ^2 / (fi(hx) dv ni (h) 

k =0 k=O’’ B k 


< ^2 [ a k + — J Vrii (B k ) 

k =o '■ m ' 

m ^ 

^ ^ Q'kL'rii ? 

k =0 
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and analogously that 


^ / <p(hgx) du ni ( h ) - y ak^m ( B k g x ) 

k= 0-' s fc9 _1 fc=0 


< 


1 

m ’ 


from which it follows that 

_^ 2 

'ipn i {gx) - ip n Ax) :< V \ak\ • | VmiBkg^ 1 ) - i'n i (-B fc )| + — ■ 

z —' m 

By (8.27) the first term of the right-hand side tends to zero as i tends to infinity, 
and this completes the proof. ■ 

When there exist a right-invariant measure v over G and a sequence of subsets 
G n of G with G„ C G„+i, U Gn = G, and u(G n ) = c n < oo, it is suggestive 
to take for the probability measures v n of Theorem 8.5.1 the measures v/c n 
truncated on G n ■ This leads to the desired result in the example below. On the 
other hand, there are cases in which there exists such a sequence of subsets of 
G„ but no invariant test satisfying (8.23) and hence no sequence u n satisfying 
(8.27). 


Example 8.5.1 Let x = (xi,... ,x„), A be the class of Borel sets in n-space, 
and G the group of translations {x\ + <?,..., x„ +g), —oo < g < oo. The elements 
of G can be represented by the real numbers, and the group product gg' is then 
the sum g + g'. If B is the class of Borel sets on the real line, the measurability 
assumptions of Theorem 8.5.1 are satisfied. Let v be Lebesgue measure, which 
is clearly invariant under G, and define v n to be the uniform distribution on the 
interval I(—n, n) = {g : —n < g < n}. Then for all B £ B, g £ G, 

\v n {B) - v n {Bg )| = ^-\v[B C I(-n,n)\ - v[B r I(-n - g,n - g))\ < |||, 
so that (8.27) is satisfied. 

This argument also covers the group of scale transformations {ax i,.. . , ax n ), 
0 < a < oo, which can be transformed into the translation group by taking 
logarithms. ■ 

When applying the Hunt-Stein theorem to obtain invariant minimax tests, 
it is frequently convenient to carry out the calculation in steps, as was done in 
Theorem 6.6.1. Suppose that the problem remains invariant under two groups D 
and E, and denote by y = s(x) a maximal invariant with respect to D and by 
E* the group defined in Theorem 6.2.2, which E induces in y-space. If D and 
E* satisfy the conditions of the Hunt-Stein theorem, it follows first that there 
exists a maximin test depending only on y = s{x), and then that there exists a 
maximin test depending only on a maximal invariant « = t{y) under E*. 


Example 8.5.2 Consider a univariate linear hypothesis in the canonical form 
in which Yi,... ,Y n are independently distributed as N{r/i,a 2 ), where it is given 
that r]s+i = ■ ■ ■ = r/n = 0, and where the hypothesis to be tested is rji = ■ ■ - = 
r/r = 0. It was shown in Section 7.1 that this problem remains invariant under 
certain groups of transformations and that with respect to these groups there 
exists a UMP invariant test. The groups involved are the group of orthogonal 
transformations, translation groups of the kind considered in Example 8.5.1, and 
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a group of scale changes. Since each of these satisfies the assumptions of the 
Hunt-Stein theorem, and since they leave invariant the problem of maximizing 
the minimum power over the set of alternatives 


E J > Vh (Vh > 0), (8.29) 

i=i a 

it follows that the UMP invariant test of Chapter 7 is also the solution of this 
maximin problem. It is also seen slightly more generally that the test which is 
UMP invariant under the same groups for testing 


E 




(Problem 7.4) maximizes the minimum power over the alternatives (8.29) for 

i>o < V'l- ■ 


Example 8.5.3 (Stein) Let G be the group of all nonsingular linear trans¬ 
formations of p-space. That for p > 1 this does not satisfy the conditions of 
Theorem 8.5.1 is shown by the following problem, which is invariant under G 
but for which the UMP invariant test does not maximize the minimum power. 
Generalizing Example 6.2.1, let X = (Xi, ..., X p ), Y = (Yi,..., Y p ) be indepen¬ 
dently distributed according to p-variate normal distributions with zero means 
and nonsingular covariance matrices E ( X t X,) = cr,, and E(YiYj) = Aery, and 
let H : A < Ao be tested against A > Ai (Ao < Ai), the cry- being unknown. 

This problem remains invariant if the two vectors are subjected to any common 
nonsingular transformation, and since with probability 1 this group is transitive 
over the sample space, the UMP invariant test is trivially ip(x, y) = a. The 
maximin power against the alternatives A > Ai that can be achieved by invariant 
tests is therefore a. On the other hand, the test with rejection region Y] 2 /X\ > C 
has a strictly increasing power function /3(A), whose minimum over the set of 
alternatives A > Ai is /3(Ai) > /3(Ao) = a. ■ 

It is a remarkable feature of Theorem 8.5.1 that its assumptions concern only 
the group G and not the distributions Pe 3 When these assumptions hold for a 
certain G it follows from (8.23) as in the proof of Lemma 8.4.1 that for any testing 
problem which remains invariant under G and possesses a UMP invariant test, 
this test maximizes the minimum power over any invariant class of alternatives. 
Suppose conversely that a UMP invariant test under G has been shown in a 
particular problem not to maximize the minimum power, as was the case for 
the group of linear transformations in Example 8.5.3. Then the assumptions of 
Theorem 8.5.1 cannot be satisfied. However, this does not rule out the possibility 
that for another problem remaining invariant under G, the UMP invariant test 
may maximize the minimum power. Whether or not it does is no longer a property 
of the group alone but will in general depend also on the particular distributions. 


3 These assumptions are essentially equivalent to the condition that the group G is 
amenable. Amenability and its relationship to the Hunt—Stein theorem are discussed by 
Bondar and Milnes (1982) and (with a different terminology) by Stone and von Randow 
(1968). 
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Consider in particular the problem of testing H : £1 = • • • = = 0 on 

the basis of a sample (X a i,..., X ap ), a = 1 from a p-variate normal 

distribution with mean E(X a i) = £* and common covariance matrix (op,) = 
. This problem remains invariant under a number of groups, including 
that of all nonsingular linear transformations of p-space, and a UMP invariant 
test exists. An invariant class of alternatives under these groups is 

>rr. (8.30) 

Here, Theorem 8.5.1 is not applicable, and the question of whether the T 2 -test 
of H : tp = 0 maximizes the minimum power over the alternatives 

EE a ^ = ^ ( 8 - 31 ) 

[and hence a fortiori over the alternatives (8.30)] presents formidable difficulties. 
The minimax property was proved for the case p = 2, n = 3 by Giri, Kiefer, and 
Stein (1963), for the case p = 2, n = 4 by Linnik, Pliss, and Salaevskii (1968), 
and for p = 2 and all n > 3 by Salaevskii (1971). The proof is effected by first 
reducing the problem through invariance under the group Gi of Example 6.6.11, 
to which Theorem 8.5.1 is applicable, and then applying Theorem 8.1.1 to the 
reduced problem. It is a consequence of this approach that it also establishes 
the admissibility of T 2 as a test of H against the alternatives (8.31). In view of 
the inadmissibility results for point estimation when p > 3 (see TPE2, Sections 
5.4-5.5, it seems unlikely that T 2 is admissible for p > 3, and hence that the same 
method can be used to prove the minimax property in this situation. 

The problem becomes much easier when the minimax property is considered 
against local or distant alternatives rather than against (8.31). Precise definitions 
and proofs of the fact that T 2 possesses these properties for all p and n are 
provided by Giri and Kiefer (1964) and in the references given in Section 7.9. 

The theory of this and the preceding section can be extended to confidence 
sets if the accuracy of a confidence set at level 1 — a is assessed by its volume 
or some other appropriate measure of its size. Suppose that the distribution of 
X depends on the parameters 8 to be estimated and on nuisance parameters $, 
and that p is a a-finite measure over the parameter set lo = {9 : (9, i?) £ II}, 
with u> assumed to be independent of ■$. Then the confidence sets S(X) for 9 are 
minimax with respect to p at level 1 — a if they minimize 

supE e ^p[5(X)] 

among all confidence sets at the given level. 

The problem of minimizing Efj,[S(X)] is related to that of minimizing the 
probability of covering false values (the criterion for accuracy used so far) by the 
relation (Problem 8.34) 

£ flo ,*p[S(A')] = f Pe o A0£S(X)]dn(9), (8.32) 

which holds provided p assigns measure zero to the set {9 = So}- (For the special 
case that 9 is real-valued and p Lebesgue measure, see Problem 5.26.) 

Suppose now that the problem of estimating 9 is invariant under a group G in 
the sense of Section 6.11 and that it satisfies the invariance condition 


p[S(gx)\ = p[S(a:)]. 


(8.33) 
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If uniformly most accurate equivariant confidence sets exist, they minimize (8.32) 
among all equivariant confidence sets at the given level, and one may hope that 
under the assumptions of the Hunt-Stein theorem, they will also be minimax 
with respect to g among the class of all (not necessarily equivariant) confidence 
sets at the given level. Such a result does hold and can be used to show for 
example that the most accurate equivariant confidence sets of Examples 6.11.2 
and 6.11.3 minimize their maximum expected Lebesgue measure. A more general 
class of examples is provided by the confidence intervals derived from the UMP 
invariant tests of univariate linear hypotheses such as the confidence spheres for 
8i = /j, + ai or for a; given in Section 7.4. 

Minimax confidence sets S(x) are not necessarily admissible; that is, there may 
exist sets S'(x) having the same confidence level but such that 

E e ^ii[S'{X)]<E e ^n[S(X)} for all M 
with strict inequality holding for at least some (#,i9). 


Example 8.5.4 Let X, (i = l,...,s) be independently normally distributed 
with mean E(Xi) = 9i and variance 1, and let G be the group generated by 
translations X; + c; (i = 1 ,..., s) and orthogonal transformations of (Xi, ..., A' s ). 
(G is the Euclidean group of rigid motions in s-space.) In Example 6.12.2, it was 
argued that the confidence sets 

Co = {( 6 > 1 ,..., 6 > S ) : ^((9; - Xif < c} (8.34) 

are uniformly most accurate equivariant. The volume /rfS^X)] of any confidence 
set S(X) remains invariant under the transformations g £ G, and it follows 
from the results of Problems 8.26 and 8.4 and Examples 8.5.1 and 8.5.2 that the 
confidence sets (8.34) minimize the maximum expected volume. 

However, very surprisingly, they are not admissible unless s = 1 or 2. In the 
case s > 3, Stein (1962) suggested the region (8.34) can be improved by recentered 
regions of the form 

Cl = {( 6 > 1 ,..., e s ) : (9i - bXif < c} , (8.35) 

where S = max( 0,1 — (s — 2 )/ JA Xf). In fact, Brown (1966) proved that, for 
s > 3, 


Pe{9 £ Cl} > Pg{9 G Co} 

for all 8. This result, which will not be proved here, is closely related to the in¬ 
admissibility of Xi,..., A s as a point estimator of (65,..., 8 S ) for a wide variety 
of loss functions. The work on point estimation, which is discussed in TPE2, 
Sections 5.4-5. 6 , for squared error loss, provides easier access to these ideas than 
the present setting. Further entries into the literature on admissibility are Stein 
(1981), Hwang and Casella (1982), and Tseng and Brown (1997); additional 
references are provided in TPE2, p.423. 

The inadmissibility of the confidence sets (8.34) is particularly surprising in 
that the associated UMP invariant tests of the hypotheses H : 9i = 9i 0 (i = 
1,..., s) are admissible (Problems 8.24, 8.25). ■ 



8.6 Most Stringent Tests 


8 .6. Most Stringent Tests 337 


One of the practical difficulties in the consideration of tests that maximize the 
minimum power over a class Ok of alternatives is the determination of an appro¬ 
priate O k- If no information is available on which to base the choice of this set, 
and if a natural definition is not imposed by invariance arguments, a frequently 
reasonable definition can be given in terms of the power that can be achieved 
against the various alternatives. The envelope power function /3„ was defined in 
Problem 6.25 by 

/?*( 6 >) = sup (3 V {6), 

where f3 v denotes the power of a test <p and where the supremum is taken over 
all level-a tests of H. Thus /!*(#) is the maximum power that can be attained 
at level a against the alternative 9. (That it can be attained follows under mild 
restrictions from Theorem A.5.1 of the Appendix.) If 

= {0: P* a (9) = A}, 

then of two alternatives 9\ £ Sa , #2 £ >Sa 2 , 9 1 can be considered closer to H, 
equidistant, or further away than #2 as Ai is <, =, or > A 2 . 

The idea of measuring the distance of an alternative from H in terms of the 
available information has been encountered before. If for example AT,..., X n is a 
sample from N(£, a 2 ), the problem of testing H : £ < 0 was discussed (Section 5.2) 
both when the alternatives f are measured in absolute units and when they are 
measured in a-units. The latter possibility corresponds to the present proposal, 
since it follows from invariance considerations (Problem 6.25) that /?«(£, <r) is 
constant on the lines f/a = constant. 

Fixing a value of A and taking as Qk the class of alternatives 9 for which 
Pa{9) > A, one can determine the test that maximizes the minimum power over 
0.K- Another possibility, which eliminates the need of selecting a value of A, is 
to consider for any test p the difference (3a{9) — P v {9). This difference measures 
the amount by which the actual power f3 v (9) falls short of the maximum power 
attainable. A test that minimizes 

sup[/3„(0) - f) v {6)\ (8.36) 

Cl — UJ 

is said to be most stringent. Thus a test is most stringent if it minimizes its 
maximum shortcoming. 

Let (fiA be a test that maximizes the minimum power over S&, and hence 
minimizes the maximum difference between /?*(#) and f3 v {9) over S*a- If Pa 
happens to be independent of A, it is most stringent. This remark makes it 
possible to apply the results of the preceding sections to the determination of 
most stringent tests. Suppose that the problem of testing H : 9 £ to against 
the alternatives 9 £ — to remains invariant under a group G, that there exists 

a UMP almost invariant test p 0 with respect to G, and that the assumptions 
of Theorem 8.5.1 hold. Since /?*(#) and hence the set S& is invariant under G 
(Problem 6.25), it follows that ipo maximizes the minimum power over S& for 
each A, and ipo is therefore most stringent. 

As an example of this method consider the problem of testing H : pi,... ,p n < 
| against the alternative K : pi > | for all i, where pi is the probability of success 
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in the ith trial of a sequence of n independent trials. If Xi is 1 or 0 as the ith trial 
is a success or failure, then the problem remains invariant under permutations of 
the X's, and the UMP invariant test rejects (Example 6.3.5) when ^2 Xi > C. It 
now follows from the remarks above that this test is also most stringent. 

Another illustration is furnished by the general univariate linear hypothesis. 
Here it follows from the discussion in Example 8.5.2 that the standard test for 
testing H : r/i = ■ ■ ■ = r) r = 0 or H' : T ?/ 0-2 < ipo is most stringent. 

When the invariance approach is not applicable, the explicit determination of 
most stringent tests typically is difficult. The following is a class of problems for 
which they are easily obtained by a direct approach. Let the distributions of X 
constitute a one-parameter exponential family, the density of which is given by 
(3.19), and consider the hypothesis H \ 9 = 9q. Then according as 9 > 9q or 
9 < 9o, the envelope power Pa{9) is the power of the UMP one-sided test for 
testing H against 9 > 9q or 9 < 9 q. Suppose that there exists a two-sided test po 
given by (4.3), such that 

sup \P* (9) - P vo {9)} = sup [/?* {0) - P vo (0)], (8.37) 

0<9q 9>9q 

and that the supremum is attained on both sides, say at points 9\ < 9q < 02- 
If Pvo ($») =/?»,* = 1 , 2 , an application of the fundamental lemma [Theo¬ 
rem 3.6.1 (iii)] to the three points 9i, 02, do shows that among all tests p with 
P v {9 1 ) > pi and p v {9 2 ) > P 2 , only po satisfies P v (9o) < a. For any other level-a 
test, therefore, either P v (6i) < pi or P v ,(02) < P 2 , and it follows that ipo is the 
unique most stringent test. The existence of a test satisfying (8.37) can be proved 
by a continuity consideration [with respect to variation of the constants Ci and 
7 i which define the boundary of the test (4.3)] from the fact that for the UMP 
one-sided test against the alternatives 9 > 9q the right-hand side of (8.37) is 
zero and the left-hand side positive, while the situation is reversed for the other 
one-sided test. 


8.7 Problems 

Section 8.1 

Problem 8.1 Existence of maximin tests. 4 Let (X, A) be a Euclidean sample 
space, and let the distributions Pe, 9 £ f2, be dominated by a cr-finite measure 
over (X, A). For any mutually exclusive subsets Qh , of Q there exists a level-a 
test maximizing (8.2). 

[Let P = sup[infrj fe Eg<p(X)\, where the supremum is taken over all level-a tests 
of H : 9 £ O.H- Let ip n be a sequence of level-a tests such that infn K Eg(p n (X) 
tends to p. If ip ni is a subsequence and ip a test (guaranteed by Theorem 8.5.1 
of the Appendix) such that Egip ni (X) tends to Egip(X) for all 9 £ !!, then p is 
a level-a test and infn fc Eg<p(X) = p.] 


4 The existence of maximin tests is established in considerable generality in Cvitanic 
and Karatzas (2001). 
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Problem 8.2 Locally most powerful tests. 5 Let d be a measure of the distance 
of an alternative 9 from a given hypothesis H. A level-a test ipo is said to be 
locally most powerful (LMP) if, given any other level-a test ip, there exists A 
such that 


P vo (9) > /3ip(d) for all 9 with 0 < d(0) < A. (8.38) 

Suppose that 9 is real-valued and that the power function of every test is 
continuously differentiable at So. 

(i) If there exists a unique level-a test ipo of H : 9 = Oo, maximizing j3' v {9o), 
then ipo is the unique LMP level-a test of H against 9 > 9q for d(9) = 9—9 q. 

(ii) To see that (i) is not correct without the uniqueness assumption, let X take 
on the values 0 and 1 with probabilities P$( 0) = | — 9 3 , Pg( 1) = | + 9 3 , 
— | < 9 3 < |, and consider testing H : 9 = 0 against K : 9 > 0. Then 
every test ip of size a maximizes f3' v { 0), but not every such test is LMP. 
[Kallenberg et al. (1984).] 

(iii) The following 6 is another counterexample to (i) without uniqueness, in 
which in fact no LMP test exists. Let X take on the values 0, 1, 2 with 
probabilities 

Pe{x) = a + e ^ + 9 2 sin j for x = l,2, 

Pe( 0) = 1-Pfl(l)-M2), 

where — 1 < 9 < 1 and t is a sufficiently small number. Then a test ip at 
level a maximizes f3'{ 0) provided 

^( 1 ) + v(2) = 1 , 

but no LMP test exists. 


(iv) A unique LMP test maximizes the minimum power locally provided its 
power function is bounded away from a for every set of alternatives which 
is bounded away from H. 


(v) Let Xi..... X n be a sample from a Cauchy distribution with unknown 
location parameter 9, so that the joint density of the X’s is n~ n ]~[” =1 [1 + 
(xi — (9) 2 ] -1 . The LMP test for testing 9 = 0 against 9 > 0 at level a < | 
is not unbiased and hence does not maximize the minimum power locally, 
[(iii): The unique most powerful test against 9 is 


/ ¥>( 1 ) 

l m 


= i 


if sin 



< 



and each of these inequalities holds at values of 9 arbitrarily close to 0. 
(v): There exists M so large that any point with x t > M for all i = 1 ,... ,n 
lies in the acceptance region of the LMP test. Hence the power of the test 
tends to zero as 9 tends to infinity.] 


''’Locally optimal tests for multiparameter hypotheses are given in Gupta and 
Vermeire (1986). 

6 Due to John Pratt. 
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Problem 8.3 A level-a test ipo is locally unbiased (loc. unb.) if there exists 
Ao > 0 such that (3 V0 (9) > a for all 9 with 0 < d(9) < Ao; it is LMP loc. unb. if 
it is loc. unb. and if, given any other loc. unb. level-a test ip, there exists A such 
that (8.38) holds. Suppose that 9 is real-valued and that d(6) = \9—9q\, and that 
the power function of every test is twice continuously differentiable at 9 = 9c,. 

(i) If there exists a unique test ipo of H : 9 = 9o against K : 9 ^ 9c, which 
among all loc. unb. tests maximizes f3''(9o), then ipo is the unique LMP 
loc. unb. level-a test of H against K. 

(ii) The test of part (i) maximizes the minimum power locally provided its 
power function is bounded away from a for every set of alternatives that 
is bounded away from H. 

[(ii): A necessary condition for a test to be locally minimax is that it is loc. unb.] 

Problem 8.4 Locally uniformly most powerful tests. If the sample space is finite 
and independent of 9, the test cpo of Problem 8.2(i) is not only LMP but also 
locally uniformly most powerful (LUMP) in the sense that there exists a value 
A > 0 such that cp o maximizes f3 v (0) for all 9 with 0 < 9 — 9o < A. 

[See the argument following (6.21) of Section 6.9.] 

Problem 8.5 The following two examples show that the assumption of a finite 
sample space is needed in Problem 8.4. 

(i) Let X, , ..., X n be i.i.d. according to a normal distribution N(a, a 2 ) and 
test H : cr = no against K : a > ao- 

(ii) Let X and Y be independent Poisson variables with E(X) = A and E(Y) = 
A + 1, and test H : A = Ao against K : A > Ao- In each case, determine the 
LMP test and show that it is not LUMP. 

[Compare the LMP test with the most powerful test against a simple alternative.] 


Section 8.2 

Problem 8.6 Let the distribution of X depend on the parameters (9, 9) = 
(9,,..., 9 r , 9 \,..., 9 S ). A test of H : 9 = 9° is locally strictly unbiased if for 
each p, (a) /3 v (9°,ip) = a, (b) there exists a 9- neighborhood of 9° in which 
P v (9,d) > a for 9 ^ 6°. 

(i) Suppose that the first and second derivatives 

and (tf) = 9) ^ 

exist for all critical functions ip and all 9. Then a necessary and sufficient 
condition for ip to be locally strictly unbiased is that f)' v = 0 for all i and 
9, and that the matrix (132(9)) is positive definite for all 9. 

(ii) A test of H is said to be of type E (type D is s = 0 so that there are no 
nuisance parameters) if it is locally strictly unbiased and among all tests 


$>(*) = JfM 0 ’*) 
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with this property maximizes the determinant |(/3£')|. 7 (This determinant 
under the stated conditions turns out to be equal to the Gaussian curvature 
of the power surface at 9°.) Then the test ipo given by (7.7) for testing the 
general linear univariate hypothesis (7.3) is of type E. 


[(ii): With 9 = (rj i,..., rj r ) and d = (r; r+ 1 ,..., n s ,a), the test ipo, by Problem 7.5, 
has the property of maximizing the surface integral 

f [$?(??, o- 2 ) — a] dA 
J s 

among all similar (and hence all locally unbiased) tests where S = {(r/i ,..., r/ r ) : 
Vi ~ P 2(j2 }- Letting p tend to zero and utilizing the conditions 


P’i p (d) = 0, f rnr]jdA = 0 for i^j, [ rft dA = k(pa) , 

J s J s 

one finds that <po maximizes fiwiVi a2 ) among all locally unbiased tests. 

Since for any positive definite matrix, |(/3£')| < it follows that for any 

locally strictly unbiased test ip, 


m\ < m ^ 



r 

< 

Wol 

r 


r 


\C\ 


\m\-\ 


Problem 8.7 Let Zi ,..., Z n be identically independently distributed according 
to a continuous distribution D, of which it is assumed only that it is symmetric 
about some (unknown) point. For testing the hypothesis H : D( 0) = \ . the sign 
test maximizes the minimum power against the alternatives K : D( 0) < q(q < |). 
[A pair of least favorable distributions assign probability 1 respectively to the 
distributions F € H, G € K with densities 


/(*) = 


1-2 q 

2(1 -q) \1 ~q 


[Ml 


g{x) = (1 - 2 q) 


1 -q 


II®11 


where for all x (positive, negative, or zero) [x] denotes the largest integer < x.] 


Problem 8.8 Let fe(x) = 9g(x) + (1 — 6)h{x) with 0 < 9 < 1. Then fe{x) 
satisfies the assumptions of Lemma 8.2.1 provided g(x)/h(x) is a nondecreasing 
function of x. 


Problem 8.9 Let x = (xi,... ,x n ), and let gg(x ,£) be a family of probability 
densities depending on 9 = (9i,... ,9 r ) and the real parameter and jointly 
measurable in x and For each 6, let hg(£) be a probability density with respect 
to a cr-finite measure v such that pe{x) = f gg(x,£)he(£) dv(£) exists. We shall 
say that a function / of two arguments u = (ui ,..., u r ), v = (iq,..., v s ) is non¬ 
decreasing in (u,v) if f(u',v)/f(u,v) < f(u',v')/f(u,v') for all (u,v) satisfying 
Ui < u'i, Vj < v'j (i — 1,..., r; j = 1,..., s). Then pg(x) is nondecreasing in (x, 9) 
provided the product gg{x,( ! )hg{^) is (a) nondecreasing in (x,0) for each fixed £; 


' An interesting example of a type-D test is provided by Cohen and Sackrowitz (1975), 
who show that the y 2 -test of Chapter 14.3 has this property. Type D and E tests were 
introduced by Isaacson (1951). 
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(b) nondecreasing in ( 9 , £) for each fixed x\ (c) nondecreasing in ( x , £) for each 
fixed 9. 

[Interpreting ge(x,£) as the conditional density of x given £, and hg(£) as the a 
priori density of £, let p(£) denote the a posteriori density of £ given x, and let 
p'(£) be defined analogously with O' in place of 9. That pg(x) is nondecreasing in 
its two arguments is equivalent to 


ge(x',Q 

ge(x,0 


p(£) < 


0',£) 

9e> (x,0 


p\i)dv{ 0 - 


By (a) it is enough to prove that 

D= [ ^#[p'(0-p(0]<M0> 0. 

J go{x,0 

Let S- = {£, : p'(£)/p(£) < 1} and S+ = {£ : p(£)/p(£) > !}• By (b) the set S- 
lies entirely to the left of S+. It follows from (c) that there exists a < b such that 

D = a f [p'(0 - p(£)] dv(£) +b f [p(£) - p(^)[ dv(£), 

Js_ Js + 

and hence that D = (6 — a) f s + [p'(^) — p(0] dv{£) > 0.] 


Problem 8.10 (i) Let X have binomial distribution b(p,n), and consider 

testing H : p = po at level a against the alternatives Qk : p/q < \po/qo or 
> 2po/?o- For a = .05 determine the smallest sample size for which there 
exists a test with power > .8 against Qk if Po = .1, -2, .3, .4, .5. 

(ii) Let Xi, ..., X n be independently distributed as iV(£, a 2 ). For testing a = 1 
at level a = .05, determine the smallest sample size for which there exists 
a test with power > .9 against the alternatives a 2 < | and a 2 > 2. 

[See Problem 4.5.] 


Problem 8.11 Double-exponential distribution. Let Xi, X n be a sample 
from the double-exponential distribution with density The LMP test 

for testing 9 < 0 against 9 > 0 is the sign test, provided the level is of the form 



so that the level-a sign test is nonrandomized. 

[Let Rk (k = 0,..., n) be the subset of the sample space in which k of the X’s 
are positive and n — k are negative. Let 0 < k < l < n, and let Sk, Si be subsets 
of Rk, Ri such that Po(Sk) = Po{Si) 0. Then it follows from a consideration 
of Pg{Sk) and Po(S'z) for small 9 that there exists A such that Pg(Sk) < Pe(Si) 
for 0 < 9 < A. Suppose now that the rejection region of a nonrandomized test 
of 9 = 0 against 9 > 0 does not consist of the upper tail of a sign test. Then it 
can be converted into a sign test of the same size by a finite number of steps, 
each of which consists in replacing an Sk by an Si with k < l, and each of which 
therefore increases the power for 9 sufficiently small.] 
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Section 8.3 

Problem 8.12 If (8.13) holds, show that q\ defined by (8.11) belongs to V\. 

Problem 8.13 Show that there exists a unique constant b for which qo defined 
by ( 8 . 11 ) is a probability density with respect to /r, that the resulting qo belongs 
to Vo, and that b — > oo as eo H► 0 . 

Problem 8.14 Prove the formula (8.15). 

Problem 8.15 Show that if Vo 7 ^ Vi and eo, ei are sufficiently small, then 

Qo ¥= Qi- 

Problem 8.16 Evaluate the test (8.21) explicitly for the case that Pi is the 
normal distribution with mean and known variance a 2 , and when eo = ei. 

Problem 8.17 Determine whether (8.21) remains the maximin test if in the 
model (8.20) Gi is replaced by Gij. 

Problem 8.18 Write out a formal proof of the maximin property outlined in 
the last paragraph of Section 8.3. 


Section 8.4 

Problem 8.19 Let X\. ..., A'„ be independently normally distributed with 
means E{Xi) = [u and variance 1. The test of H : /n = • • • = t= 0 that 
maximizes the minimum power over u/ : ^2 M* — d rejects when A, > C. 

[If the least favorable distribution assigns probability 1 to a single point, in¬ 
variance under permutations suggests that this point will be /xi = • • • = n„ = 
d/n ]. 

Problem 8.20 8 (i) In the preceding problem determine the maximin test if 
u>’ is replaced by X) > d, where the a’s are given positive constants. 

(ii) Solve part (i) with Var(Xi) = 1 replaced by Var(Xi) = a 2 (known). 

[(i): Determine the point (fil, ..., in u>' for which the MP test of H against 
K : (/.ij,..., fj,n) has the smallest power, and show that the MP test of H against 
K is a maximin solution.] 

Problem 8.21 Let X\, ..., X. n be independent normal variables with variance 
1 and means £i, ..., and consider the problem of testing H : = • • • = 

= 0 against the alternatives K = {K\ ,..., 7\„}, where K t : = 0 for j ^ i, 

£i = £. (known and positive). Show that the problem remains invariant under 
permutation of the A’s and that there exists a UMP invariant test <j>o which 
rejects when > C, by the following two methods. 

(i) The order statistics A'( 1 ) < • • • < A(„) constitute a maximal invariant. 


®Due to Fritz Scholz. 
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(ii) Let fo and f% denote the densities under H and A'; respectively. Then the 
level-a test (f>o of H vs. K' : / = (1/n) £ fi is UMP invariant for testing 
H vs. A. 

[(ii): If <j >o is not UMP invariant for H vs. A, there exists an invariant test <j >i 
whose (constant) power against K exceeds that of <j> o. Then <j >i is also more 
powerful against A''.] 


Problem 8.22 The UMP invariant test <j> o of Problem 8.21 

(i) maximizes the minimum power over A; 

(ii) is admissible. 

(iii) For testing the hypothesis H of Problem 8.21 against the alternatives A' = 
{A'i,..., K n , K [,..., K' n }, where under K[ : £j = 0 for all j i, & = — £, 
determine the UMP test under a suitable group G ', and show that it is 
both maximin and invariant. 


[ii): Suppose (j> is uniformly at least as powerful as <j> o, and more powerful for at 
least one A';, and let 


</>*(*!, 


£<£'(*»!, ■■■,*»„) 


where the summation extends over all permutations. Then <j>* is invariant, and 
its power is independent of i and exceeds that of (po-\ 


Problem 8.23 For testing H : fo against K : {/i,..., f a }, suppose there exists 
a finite group G = {gi,... ,pw} which leaves H and K invariant and which is 
transitive in the sense that given fj, fj'(l < j,j') there exists g € G such that 
gfj = fji. In generalization of Problems 8.21, 8.22, determine a UMP invariant 
test, and show that it is both maximin against K and admissible. 


Problem 8.24 To generalize the results of the preceding problem to the testing 
of H : f vs. K : {fg,6 £ w}, assume: 

(i) There exists a group G that leaves H and K invariant. 

(ii) G is transitive over ui. 

(iii) There exists a probability distribution Q over G which is right-invariant in 
the sense of Section 8.4. 

Determine a UMP invariant test, and show that it is both maximin against K 
and admissible. 


Problem 8.25 Let Xi, ..., X n be independent normal with means 6 \, ..., 9 n 
and variance 1. 

(i) Apply the results of the preceding problem to the testing of II : 0\ — • • • _ 
9 n = 0 against K : X) 9\ = r 2 , for any fixed r > 0. 

(ii) Show that the results of (i) remain valid if H and K are replaced by 
H’ : E 9l < r 2 0 , A" : £ 0? > rj (r„ < n). 
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Problem 8.26 Suppose in Problem 8.25(i) the variance a 2 is unknown and that 
the data consist of Xi,..., X n together with an independent random variable S 2 
for which S 2 /a 2 has a x 2 -distribution. If K is replaced by "22 9 2 /a 2 = r 2 , then 

(i) the confidence sets 22(®i — Xi) 2 /S 2 < C are uniformly most accurate 
equivariant under the group generated by the n-dimensional generalization 
of the group Go of Example 6.11.2, and the scale changes X[ = cXt, S' 2 = 
c 2 S 2 . 

(ii) The confidence sets of (i) are minimax with respect to the measure p given 
by 

p[C(X, S' 2 )] = -![ volume of C(X, S 2 )]. 

[Use polar coordinates with 9 2 = 22 #?•] 


Section 8.5 

Problem 8.27 Let X = (Xi,... ,X P ) and Y = (Y\, ... ,Y P ) be independently 
distributed according to p-variate normal distributions with zero means and 
covariance matrices E(XiXj) = aij and E(\)Yj) = Aaij. 

(i) The problem of testing H : A < Ao remains invariant under the group G of 
transformations A'* = XA, Y* = YA. where A = (aij) is any nonsingular 
p x p matrix with aij = 0 for i > j, and there exists a UMP invariant test 
under G with rejection region Y 2 /Xf > C. 

(ii) The test with rejection region Y 2 /Xf > C maximizes the minimum power 
for testing A < Ao against A > Ai (Ao < Ai). 

[(ii): That the Hunt-Stein theorem is applicable to G can be proved in steps 
by considering the group G q of transformations X' q = ouAT + • • • + a q X q , 
X'i — Xi for i = 1 , ..., q — 1 , q + 1 , ..., p, successively for q = 1 , 
..., p — 1. Here a q 7 ^ 0, since the matrix A is nonsingular if and only if 
an 0 for all i. The group product (71 ,..., 7,) of two such transformations 
(ai, ..., a,) and (pi,.. .,p q ) is given by 71 = a q + Pi, 72 = a 2 p q + p2, ■■■, 
"fq-i = a q -ip q + p q -i, "fq = ct q , p q , which shows G q to be isomorphic 
to a group of scale changes (multiplication of all components by P q ) and 
translations [addition of (pi,..., P q -i, 0)]. The result now follows from the 
Hunt-Stein theorem and Example 8.5.1, since the assumptions of the Hunt- 
Stein theorem, except for the easily verifiable measurability conditions, 
concern only the abstract structure (G,B), and not the specific realization 
of the elements of G as transformations of some space.] 

Problem 8.28 Suppose that the problem of testing 0 £ Qh against 6 £ Qk 
remains invariant under G, that there exists a UMP almost invariant test ipo 
with respect to G, and that the assumptions of Theorem 8.5.1 hold. Then ip 0 
maximizes info K [w(6)Eep(X) + u(0)\ for any weight functions w(6) > 0, u(0) 
that are invariant under G. 

Problem 8.29 Suppose X has the multivariate normal distribution in R fc with 
unknown mean vector h and known positive definite covariance matrix C -1 . 
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Consider testing h = 0 versus |C ll//2 h| > b for some b > 0, where | • | denotes the 
Euclidean norm. 

(i) Show the test that rejects when |C' 1 ' 2 A'| 2 > Ck,i~ a is maximin, where Ck,i~ a 
denotes the 1 — a quantile of the Chi-squared distribution with k degrees of 
freedom. 

(ii) Show that the maximin power of the above test is given P{Xk{b 2 ) > 

where Xk(b 2 ) denotes a random variable that has the noncentral Chi-squared 
distribution with k degrees of freedom and noncentrality parameter b 2 . 


Problem 8.30 Suppose AT,..., AT are independent, with AT ~ N(9i , 1). Con¬ 
sider testing the null hypothesis 9i = ■ ■ ■ = 9k = 0 against max \6i\ > 5 , for some 
<5 > 0. Find a maximin level a test as explicitly as possible. Compare this test 
with the maximin test if the alternative parameter space were JT 9 2 > 8 2 . Argue 
they are quite similar for small 8. Specifically, consider the power of each test 
against (8, 0,..., 0) and show that it is equal to a + C a 8 2 + o(8 2 ) as <5 —> 0, and 
the constant C a is the same for both tests. 


Section 8.6 

Problem 8.31 Existence of most stringent tests. Under the assumptions of 
Problem 8.1 there exists a most stringent test for testing 9 £ Qh against 8 £ 
12 — fin. 


Problem 8.32 Let {Ua} be a class of mutually exclusive sets of alternatives 
such that the envelope power function is constant over each Ua and that 
UHa = Q — fifr, and let <^a maximize the minimum power over Ha- If <Pa = v 3 
is independent of A, then ip is most stringent for testing 9 £ Qh- 


Problem 8.33 Let (Z \,..., Zn ) = (AT,..., X rn , Vj,..., Y n ) be distributed ac¬ 
cording to the joint density (5.55), and consider the problem of testing H : g = £ 
against the alternatives that the A"’s and P’s are independently normally dis¬ 
tributed with common variance a 2 and means r] ^ £. Then the permutation test 
with rejection region \Y — X\ > C[T(Z)], the two-sided version of the test (5.54), 
is most stringent. 

[Apply Problem 8.32 with each of the sets 12 a consisting of two points (£1,171, a), 
(£2, ??2, cr) such that 


£1 = c - 

£2 = c + 


n 

m + n 
n 

m + n 


8 , 
8 , 


m = C + 
772 = C - 


m 

m + n 
m 

m + n 


8 ; 
8 


for some £ and <5.[ 


Problem 8.34 Show that the UMP invariant test of Problem 8.21 is most 
stringent. 
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8.8 Notes 

The concepts and results of Section 8.1 are essentially contained in the minimax 
theory developed by Wald for general decision problems. An exposition of this 
theory and some of its applications is given in Wald’s book (1950). For more 
recent assessments of the important role of the minimax approach, see Brown 
(1994, 2000). The ideas of Section 8.3, and in particular Theorem 8.3.1, are 
due to Huber (1965) and form the core of his theory of robust tests [Huber 
(1981, Chapter 10)]. The material of sections 8.4 and 8.5, including Lemma 8.4.1, 
Theorem 8.5.1, and Example 8.5.2, constitutes the main part of an unpublished 
paper of Hunt and Stein (1946). 



9 

Multiple Testing and Simultaneous 
Inference 


9.1 Introduction and the FWER 

When testing more than one parameter, say 

H: 9i = ••• = Os = 0 (9.1) 

against the alternatives that one or more of the 0’s are positive, it is typically not 
enough simply to accept or reject H. In case of acceptance, nothing more is re¬ 
quired: the finding is that none of the parameter values are significant. However, 
when H is rejected, one will in most cases want to know just which of the param¬ 
eters 0 are significant. And when H is tested against the two-sided alternatives 
that one or more of the 0’s are different from 0, one would in case of rejection 
usually want to know the signs of the significant 0’s. 1 

Example 9.1.1 (Normal one-sample problem) Suppose that Xi ,..., X n is 

a sample from N (£, a 2 ) and consider the hypothesis H: t; < £ o, <r < no. In case of 
rejection one would want to know whether it is the mean or the variance that is 
rejected, or perhaps both. ■ 

Example 9.1.2 (Comparing several treatments with a control) Whente: 
ing several treatments against a control, the overall null hypothesis states that 
none of the treatments is an improvement over, or differs from, the control. In case 
of rejection one will wish to know just which of the treatments show a significant 
difference. ■ 


shall here disregard this latter issue, but see Comment 2 at the end of Section 
9.3. 
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Example 9.1.3 (Testing equality of several treatments) Instead of com¬ 
paring several treatments with a control, one may wish to compare a number 
of possible alternative situations with each other. If the quality of the ith of s 
alternatives is measured by a parameter 6,, the hypothesis is 

H: 6>i := »» ■ = Os . ■ (9.2) 

Since most multiple testing problems, like those in Examples 9.1.2 and 9.1.3, 
are concerned with multiple comparisons , the whole subject of multiple testing is 
frequently, and somewhat inaccurately, called multiple comparisons. 

When comparing several medical, agricultural, or industrial treatments, the 
numbers of treatments is typically fairly small, say, in the single digits. Larger 
numbers occur in some educational studies, where for example it may be desired 
to compare performance in the 50 of the U.S. states. A fairly recent application of 
multiple comparison theory occurs in microarrays where thousands or even tens 
of thousands of genes are tested simultaneously. Each microarray corresponds to 
one unit (plant, animal or person) and in these experiments the sample size (the 
number of such units) is typically of a much smaller order of magnitude (in the 
tens) than the number of comparisons being tested. 

Let us now consider the general problem of simultaneously testing a finite 
numbers of hypotheses Hi (i = l,...,s). We shall assume that tests for the 
individual hypotheses are available and the problem is how to combine them into 
a simultaneous test procedure. 

The easiest approach is to disregard the multiplicity and simply test each hy¬ 
pothesis at level a. However, with such a procedure the probability of one or more 
false rejections rapidly increases with s. When the number of true hypotheses is 
large, we shall be nearly certain to reject some of them. To get a numerical idea 
of this phenomenon, the following Table shows (to 2 decimals) the probability 
of one or more false rejections when all of the hypotheses Hi,... ,H S are true, 
when the test statistics used for testing Hi ,..., H s are independent, and when 
the level at which each of the s hypotheses is tested is a — .05. 


s_ 1 2 5 10 50 

P(at least one false rejection) .05 .10 .23 .40 .92 


In this sense the claim that the procedure controls the probability of false 
rejections at level .05 is clearly very misleading. 

We shall therefore in the present chapter replace the usual condition for testing 
a single hypothesis, that the probability of a false rejection not exceed a, by the 
requirement, when testing several hypotheses, that the probability of one or more 
false rejections, not exceed a given level. This probability is called the family-wise 
error rate (FWER). Here the term “family” refers to the collection of hypotheses 
Hi,..., H s that is being considered for joint testing. In a laboratory testing blood 
samples, this might be all the tests performed in a day, or those performed in a 
day by a given tester. Alternatively, the tests given in the morning and afternoon 
might be considered as separate families, and so on. Which tests are to be treated 
jointly as a family depends on the situation. 
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Once the family has been defined, we shall require that 

FWER < a (9.3) 

for all possible constellations of true and false hypotheses. This is sometimes 
called strong error control to distinguish it from the much weaker (and typically 
not very meaningful) condition of weak control which requires (9.3) to hold only 
when all the hypotheses of the family are true. 

Methods that control the FWER are often described by the p-values of the 
individual tests, which were introduced in Section 3.2. We now present two simple 
methods that control the FWER which can be stated easily in terms of p-values. 
Each hypothesis Hi can be viewed as a subset, Wj, of fl. Assume that p; is a 
p-value for testing Hi ; specifically, we assume 

P{pi < u} < u (9.4) 

for any u G (0,1) and any P £ u,. Note that it is not required that the distribution 
of pi be uniform on (0,1) whenever Hi is true. (For example, if Hi corresponds 
to testing 9i < 0 but the true 9i is < 0, exact uniformity is too strong. Also, even 
if the null hypothesis is simple, the p-value may have a discrete distribution.) 

Theorem 9.1.1 (Bonferroni Procedure) If, for i = 1,..., s, hypothesis Hi is re¬ 
jected when pi < a/s, then the FWER for the simultaneous testing of Hi ,..., H 3 
satisfies (9.3). 

Proof. Suppose hypotheses Hi with i £ I are true and the remainder false, with 
|/| denoting the cardinality of I. From the Bonferroni inequality it follows that 

FWER — Pjreject any Hi with i £ 1} < Pjreject Hi} 

iei 

= ^2 P {Pi < 7 - l J l Q / S < «• ■ 

iei iei 

While such Bonferroni based procedures satisfactorily control the FWER, their 
ability to detect cases in which Hi is false will typically be very low since Hi is 
tested at level a/s which - particularly if s is large - is orders smaller than the 
conventional a levels. 

For this reason procedures are prized for which the levels of the individual 
tests are increased over a/s without an increase in the FWER. It turns out that 
such a procedure due to Holm (1979) is available under the present minimal 
assumptions. 

The Holm procedure can conveniently be stated in terms of the p-values 
pi,...,p s of the s individual tests. Let the ordered p-values be denoted by 
P(i) < ... < P( s ), and the associated hypotheses by Hm, ..., H( s \. Then the 
Holm procedure is defined stepwise as follows: 

Step 1. If p( i) > a/s, accept Hi,... ,H S and stop. If p(i) < a/s reject and 
test the remaining s — 1 hypotheses at level a/(s — 1). 

Step 2. If p(i) < a/s but p( 2 ) > a/(s — 1), accept H( 2 ), ■ ■ ■, H^ and stop. If 
P(i) < a/s and p( 2 ) < a/(s — 1), reject H^ 2 ) in addition to H (jj and test the 
remaining s — 2 hypotheses at level a/(s — 2). 
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And so on. 


Theorem 9.1.2 The Holm, procedure satisfies (9.3). 

Proof. Suppose Hi with i £ I is the set of true hypotheses, so P £ u>i if and 
only if i £ I. Let j be the smallest (random) index satisfying 

p<j) = min j3i . 

Note that j < s — |/| + 1. Now, the Holm procedure commits a false rejection if 

P(i) < a/s,p (2 ) < a/{s - 1),... ,P(j) < a/{s - j + 1) , 

which certainly implies that 

min pi = pij) < a/(s - j + 1) < a/\I\ . 
i£l 

Therefore, by the Bonferroni inequality, the probability of a false rejection is 
bounded above by 

P{min pi < a/\I\) < "S' P{f>i < ot/\I\} < a . ■ 

i£l ' ^ 

iei 

The Bonferroni method is an example of a single-step procedure, meaning any 
hypothesis is rejected if its corresponding p-value is less than a common cutoff 
value (which in the Bonferroni case is a/s). The Holm procedure is a special 
case of a class of stepdoum procedures, which we now briefly describe. Roughly 
speaking, stepdown procedures begin by determining whether the test that looks 
most significant should be rejected. If each individual test is summarized by a 
p-value, this can be described as follows. Let 

ai < a 2 < ■ ■ ■ < a s (9-5) 

be constants. If pm > ai, accept all hypotheses. Otherwise, for r = l,...,s, 
reject hypotheses Hq),..., Lf( r ) if 

P(i) < au,...,p( r ) < a r ■ (9.6) 

That is, a stepdown procedure starts with the most significant p-value and con¬ 
tinues rejecting hypotheses as long as their corresponding p- values are small. The 
Holm procedure uses Oi = a/(s — i + 1). (Alternatively, if the rejection region 
of each test corresponds to large value of a test statistic, a stepdown procedure 
begins by determining whether or not the hypothesis corresponding to the largest 
test statistic should be rejected; see Procedure 9.1.1 below.) 

On the other hand, stepup procedures begin by looking at the least significant 
p-value (or the smallest value of a test statistic when the individual tests reject for 
large values). For a given set of constants (9.5), reject all hypotheses if p( s ) < a s . 
Otherwise, for r = s, ..., 1, reject hypotheses H (1 ),..., H( r ) if 

P( s ) P Q?s, ■ ■ • , P(r- (-1) P Uy+1 but P(r) ^ • (9.7) 

Safeguards against false rejections are of course not the only concern of multiple 
testing procedures. Corresponding to the power of a single test one must also 
consider the ability of a multiple test procedure to detect departures from the 
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hypotheses when they do occur. For certain parametric models, optimality results 
for some stepwise procedures will be developed in the next section. For now, we 
show that it is possible to improve upon the Holm method by incorporating the 
dependence structure of the individual tests. 

To see how, suppose that a test of the individual hypothesis Hj is based on a 
test statistic T n j, with large values indicating evidence against Hj. (The use of 
the subscript n in the test statistics will be for asymptotic purposes later on.) 

If P is the true probability distribution generating the data, let I = I(P) C 
{1,... ,s} denote the indices of the set of true hypotheses; that is, * € I if and 
only P £ u>i. For K C {1,..., s}, let Hk denote the intersection hypothesis that 
all Hi with i £ K are true; that is, Hk is equivalent to P £ C\ ieK oJi- In order 
to improve upon the Holm method, the basic idea is to use critical values that 
more accurately approximate the distribution of maxjg kT h j when testing Hk, 
at least when K is in fact true. Let 

Tn,r 1 > Tn,r 2 > ' ' ' > Iii,r s (9-8) 

denote the observed ordered test statistics, and let Hyy H^ 2 ), • • •, H( s ) be the cor¬ 
responding hypotheses. A stepdown procedure begins with the most significant 
test statistic. First, test the joint null hypothesis H{ i,..., s } that all hypotheses 
are true. This hypothesis is rejected if T„ >ri is large. If it is not large, accept 
all hypotheses; otherwise, reject the hypothesis corresponding to the largest test 
statistic. Once a hypothesis is rejected, remove it and test the remaining hypothe¬ 
ses by rejecting for large values of the maximum of the remaining test statistics, 
and so on. To be specific, consider the following generic procedure, based on crit¬ 
ical values Cn.K (1 — a), where c n ,K (1 — ct) is designed for testing the intersection 
hypothesis Hk at nominal level a. Although we are not specifying the constants 
at this point, we note that they could be nonrandom or data-dependent. 

Procedure 9.1.1 (Generic Stepdown Method) 

1. Let A'i = {1,... ,s}. If Tn.r-i < c„,Ki(l — a), then accept all hypotheses 
and stop; otherwise, reject H m and continue. 

2. Let K 2 be the indices of the hypotheses not previously rejected. If T„,r 2 < 
c n ,K 2 { 1 ~ a), then accept all remaining hypotheses and stop; otherwise, 
reject H( 2 ) and continue. 


j. Let Kj be the indices of the hypotheses not previously rejected. If T njr . < 
Cn,Kj{ 1 — a), then accept all remaining hypotheses and stop; otherwise, 
reject Hu-, and continue. 


s. If T„ iS < c nt K s (1 — a), then accept H( s y, otherwise, reject H( s y 

The problem now is how to construct the Cn,K(l — a) so that the FWER is 
controlled. The following result reduces the multiple testing problem of control¬ 
ling the FWER to that of constructing single tests that control the probability 
of a Type 1 error. 
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Theorem 9.1.3 Let P denote the true distribution generating the data. Consider 
Procedure 9.1.1 based on critical values c n ,K( 1—a) which satisfy the monotonicity 
requirement: for any K D I(P), 


Ctl,k( 1 — a) > c„j(p)( 1 — a) . 


(9.9) 


(i) Then, 


FWERp < P{max(T nj - : j £ I{P)) > c„,/(p)(l - a)} . (9.10) 

(ii) Also suppose that if c„,k{1 — a) is used to test the intersection hypothesis 
Hk, then it is level a when K = I{P); that is, 

P{ max(T„j : j £ I(P)) > c nJ ( P )(l - a)} < a . (9.11) 

Then FWERp < a. 


Proof. Consider the event that a true hypothesis is rejected, so that for some 
i £ I(P), hypothesis Hi is rejected. Let j be the smallest index j in the method 
where this occurs, so that 

max{T„j : j £ I(P)} > c n ,K, (1 - a) . (9-12) 

Since K~. D J(P), assumption (9.9) implies 

c„,k, (1 - a) > c„,/(p)(1 - a) (9.13) 

and so (i) follows. Part (ii) follows immediately from (i). ■ 


Example 9.1.4 (Multivariate Normal Mean) Suppose (A'i,..., A' s ) is mul¬ 
tivariate normal with unknown mean p = (pi ,..., p s ) and known covariance 
matrix E having (i,j) component Consider testing Hj \ pj < 0 versus 
Pj > 0. Let T n j = Xj/yfafJ, since the test that rejects for large is 

UMP for testing Hj. To apply Theorem 9.1.3, let c n ,Rr(l — a) be the 1 —a quantile 
of the distribution of max(Aj : j £ K) when p = 0. Since 

max(Aj : j £ I) < max (A, : j £ K) 

whenever / C K, the monotonicity requirement (9.9) is satisfied. Moreover, the 
resulting test procedure rejects at least as many hypotheses as the Holm proce¬ 
dure (Problem 9.5) In the special case when cr;,; = a 2 is independent of i and atj 
as the product structure <nj = XiXj, then Appendix 3 (p.374) of Hochberg and 
Tamhane (1987) reduces the problem of determining the distribution of the max¬ 
imum of a multivariate normal vector to a univariate integral. In general, one can 
resort to simulation to approximate the critical values; see Example 11.2.13. ■ 


Example 9.1.5 (One-way Layout) Suppose for i = 1,. .., s and j = 

l,...,n», A 'i,j = Pi + £i,j, where the a,j are i.i.d. N(0,o 2 ); the vector p = 
(pi ,..., p s ) and a 2 are unknown. Consider testing Hi : pi = 0 against pi ^ 0. 
Let t n ,i = n\^ 2 Xi./S, where 

nj s rii 

Xi. = nf 1 X id , S 2 = VVl.V;., - X,.f/v , 

j = 1 *=1 j = 1 
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and v = !)• Under Hi , t n ,i has a i-distribution with v degrees of freedom. 

Let T ni i = and let c n ,/f( 1 — a) denote the 1 — a quantile of the distribution 
of max(T n ,i : i £ K) when /.t = 0 and a = 1. Since 

max(T n ,i : i £ I) < max(T nj i : i £ I <) , 

the monotonicity requirement (9.9) follows. Note that the joint distribution of 
(t n ,i, ■ ■ ■ ,tn,s ) follows an s-variate multivariate f-distribution with v degrees of 
freedom; see Hochberg and Tamhane (1987, p.374-5). ■ 

When the number of tests is in the tens or hundreds of thousands, control of 
the FWER at conventional levels becomes so stringent that individual departures 
from the hypothesis have little chance of being detected, and it is unreasonable 
to control the probability of even one false rejection. A radical weakening of the 
FWER was proposed by Benjamini and Hochberg (1995), who suggested the 
following. For a given multiple testing decision rule, let N be the total number 
of rejections and let F be the number of false rejections, i.e., the number of 
rejections among the N rejections corresponding to true null hypotheses. Define 
Q to be F/N (and defined to be 0 if IV = 0). Thus Q is the proportion of 
rejected hypotheses that are rejected erroneously. When none of the hypotheses 
are rejected, both numerator and denominator of that proportion are 0, and Q 
is then defined to be 0. The false discovery rate (FDR) is 

FDR = E(Q). (9.14) 

When all hypotheses are true, FDR = FWER. In general, FDR < FWER 
(Problem 9.9), and typically this inequality is strict, so that the FDR is more 
liberal (in the sense of permitting more rejections) than the FWER. The FDR is 
a fairly recent idea, and its properties and behavior are the subject of very active 
research. We shall here only mention some recent papers on this topic: Finner 
and Roters (2001), Benjamini and Yekutielli (2001) and Sarkar (2002). 


9.2 Maximin Procedures 

In the present section we shall obtain optimal procedures for a class of problems 
of the kind illustrated in Examples 9.1.1 and 9.1.2. 

Consider the general problem of testing simultaneously s hypotheses He 6i < 0 
against the alternatives 6i > 0, (i — 1, -.., s) and suppose that we would reject 
the individual hypotheses Hi if a test statistic T) were sufficiently large. The joint 
c.d.f. of (Ti,..., T s ) will be denoted by F$, 9 = (8 i, ..., 9 3 ), and we shall assume 
that the marginal distribution of Ti depends only on The parameter and 
sample space will be assumed to be finite or infinite open rectangles 0 t < 9i < 6i 
and < ti < ti respectively. For ease of notation we shall suppose that 

fh =t i = —oo and 9i — ti = oo for all i . 

We shall assume further that, for any B, 

Poi{Ti < B} —> 1 as 9i —oo and Pf). (T) > B} —> 1 as 9i —> +oo . 

A crucial assumption will be that the distributions Fg are stochastically in¬ 
creasing in the following sense, which generalizes the univariate definition in 
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Section 3.4 to s dimensions. A set u> in IR S is said to be monotone increasing if 
t = (ti,. ■ ., t 3 ) £ lo and ti < t'i for all i implies t' £ lo , 


and the distributions Fg will be called stochastically increasing if 9i < 9[ for all 
i implies 


dFg < 



(9.15) 


for every monotone increasing set lo. 

The condition will be assumed not only for the distributions of (Ti,... ,T S ) 
but also for (±Ti,..., ±T S ). Thus, for example, for (—Ti,..., — T 3 ) it means that 
for any decreasing region the inequality (9.15) will be reversed. A class of models 
for which (9.15) holds is given in Problem 9.10. 

For the sake of simplicity, we shall suppose that when 9\ = ... = 9 S , the 
variables (Ti,..., T a ) are exchangeable , i.e., that the joint distribution is invariant 
under permutations of the components. In addition, we assume that the joint 
distribution of (T\,... ,T S ) has a density with respect to Lebesgue measure . 2 In 
order for the critical constants to be uniquely defined, we further assume that 
the joint density is positive on its (assumed rectangular) region of support, but 
this can be weakened. 

Under these assumptions we shall restrict attention to decision rules satisfying 
the following monotonicity condition. A decision procedure E for the simulta¬ 
neous testing of Hi, ..., H s based on T = (Ti,..., T s ) states for each possible 
observation vector t the subset It of {1,..., s} of values i for which the hypothesis 
Hi is rejected. A decision rule E is said to be monotone increasing if ti < for 
i £ It and t[ < ti for i £ It implies that It = It> • 

The ordered T-values will be denoted by T(i) < T( 2 ) < ■ • • < T( s ) and the 
corresponding hypotheses by IT(i),..., H( s \. Consider the following monotone 
decision procedure D, which can be viewed as an application of Procedure 9.1.1. 


The Stepdown Procedure D: 

Step 1. If T( s ) < Ci, accept Hi ,..., H s . If T( s ) > Ci but T( 3 ~i) < C2, reject H^ 
and accept #(i), • • •, 1 T( s _i). 

Step 2. If T( s ) > Ci, and T( s -i) > C 2 , but T( s _ 2 ) < C 3 reject H ( s ) and H( s _ 1 ) 
and accept iT (1 ),..., H( s _ 2 ) ■ 

And so on. The C’s are determined by 

P 0, ...,0{ max ( T i’-"’ T i) >C s -j+i} = a , (9.16) 

3 

and therefore the C’s are nonincreasing. 


Lemma 9.2.1 Under the above assumptions, the procedure D with critical 
constants given by (9.16) controls the FWER in the strong sense. 


2 This assumption is used only so that the critical constants of the optimal procedures 
lead to control at exact level a. 
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Proof. Apply Theorem 9.1.3 with c u ,k (T — a) = C 3 _|k|+i, where \K\ is the 
cardinality of K. Then, by the monotonicity of the Cs, condition (9.9) holds. We 
must verify (9.11) for every Pg. Suppose 9 is such that exactly p hypotheses are 
true. By exchangeability, we can assume Hi,..., H p are true and H p + i,..., H s 
are false. A false rejection occurs if and only if at least one of Hi,... ,H P is 
rejected. Since D is monotone, the probability of this event is largest when 

9i ==••• = 9 P = 0 and 9 P+ 1 —> oo, • • •, 9 S —> oo , 

and, by (9.16), the sup of this probability is equal to 

Pq 0 {Ti > Cs- P +i for some i = 1 ,... ,p} = a . ■ 

V 

The procedure D defined above is an example of a stepdown procedure in that 
it starts with the most significant (or, in this case, the largest) test statistic and 
continues rejecting hypotheses as long as their corresponding test statistics are 
large. In contrast, stepup procedures begin with the least significant test statistic. 
Consider the following monotone stepup procedure U. 

The Stepup Procedure U: 

Step 1. If T(i) > Ci reject Hi,... ,H S . If < C{ but T( 2 ) > CJ, accept H (!) 
and reject -ff( 2 ), ■ • •, H( s y 

Step 2. If T (1) < Ci, and T( 2 ) < C* 2 but T( 3 j > C 3 , accept H (!) and H^ and 
reject H (3) ,..., H {s) . 

And so on. The C*’s are determined by 

^0,... ,0{^ } = 1 - « , (9.17) 

3 

where 

Lj — {7)r(i) < Ci,... ,T*(j ) < Cj for some permutation of {1,... ,j}} . 

The following lemma proves control of the FWER and is left as an exercise 
(Problem 9.11). 

Lemma 9.2.2 Under the above assumptions, the stepup procedure U with critical 
constants given by (9.17) controls the FWER in the strong sense. 

Subject to controlling the FWER we want to maximize what corresponds to 
the power of a single test, i.e., the probability of rejecting hypotheses that are in 
fact false. Let 

/3i(9) = Pg {reject at least i hypotheses} 

and, for any e > 0 , let Ai(e) denote the set in the parameter space for which at 
least i of the #’s are > e. Then we shall be interested in maximizing 

inf /3i(0) for i =s 1 , 2 ,..., s. (9.18) 

6<= Aj(e) 

This is in the same spirit as the maximin criterion of Chapter 8 . However, it is 
the false hypotheses we should like to reject, and so we also consider maximizing 

inf PMreject at least i false hypotheses} . (9.19) 

SeAi(e) 
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We note the following obvious fact. 

Lemma 9.2.3 Under (9.15), for any monotone increasing procedure E, the 
functions (3i(6 \,..., 6 S ) are nondecreasing in each of the variables 9i,... ,9 S . 

For the sake of simplicity we shall now consider the maximin problem first 
for the case s — 2. Corresponding to any decision rule E, let eo,o denote the 
part of the sample space where both hypotheses are accepted, eo,i where Hi is 
accepted and H 2 is rejected, ei,o where Hi is rejected and H 2 is accepted, and 
ei,i where both Hi and H 2 are rejected. The following is an optimality result 
for the stepdown procedure D. It will be convenient in the following theorem to 
restate the procedure D in the case s = 2. 

Theorem 9.2.1 Assume the conditions described at the beginning of this section. 

(i) A monotone increasing decision procedure with FWER < a will maximize 
(9.18) for i = 1 if and only if it rejects at least one hypothesis when 

max(Ti,T 2 ) > Ci , (9.20) 

in which case Hi is rejected if Ti > Ci; in the contrary case, both hypotheses are 
accepted. The constant Ci is determined by 

Po, 0 {max(Ti,T a )>Ci} = a (9.21) 

The minimum value of (3i{9) over Ai(e) is P e {T} > C 1 }. 

(ii) A monotone increasing decision rule with FWER < a and satisfying (9.20) 
will maximize (9.18) for i = 2 if and only if it takes the following decisions: 

do,o' accept Hi and H 2 when max(Ti,T 2 ) < Ci 

di,o-' reject Hi and accept H 2 when Ti > Ci and T 2 < C 2 

do,i.' accept Hi and reject id 2 when Ti < C 2 and T 2 > Ci 

dip: reject both Hi and H 2 when both Ti and T 2 are > C 2 (and when 9.20 holds). 
Here C 2 is determined by 

Po{Ti > C 2 } = a, (9.22) 

and hence C 2 < Ci. 

The minimum probability over A 2 (e) of rejecting both hypotheses is 

P e ,e{ at least one Ti is > Ci and both are > C 2 } . 

(Hi) The result (i) holds if the criterion (9.18) is replaced by (9.19) with i = 1, 
and P e {Ti > Ci } is also the maximum value of criterion (9.19). 

Proof. To prove (i), note that the claimed optimal solution has minimum power 
when 9 = (e, —00) and D has P e {Ti > Cl} for the claimed optimal value of 
(3i (9). Now, suppose that E is any other monotone decision rule with FWER 
< a. Assume there exists (fi,t 2 ) do,o, he., rejecting at least one hypothesis, 
but (ii,t 2 ) £ eo,o- Then, there exists at least one component of (ti,t 2 ) that is 
> Ci, say ti > Ci. It follows that 

7\-oo{eo,o} > 7\-oo{Tl < ti, T 2 < t 2 } = P t {Ti < C} > P e {Ti < Ci} 
and hence 

P,,-oo{e c 0} 0 } < Pe,-oo{Ti > Cl} = Pe{Tl > Ci} . 
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Thus, E has a smaller value of criterion (9.18) than does the claimed optimal 
D. Therefore, eo,o cannot have points outside of do,o, he., eo,o must be a proper 
subset of do,o- But then, since both procedures are monotone, ej^o is bigger than 
d§,o 011 a set °f positive Lebesgue measure and so 

Po,o{eo,o} > Tb,o{do,o} = tx . 

It follows that for the maxi min procedure, the region do, 0 must be given by (9.20). 

To prove (ii), the goal now is to show that, among all monotone nondecreasing 
procedures which control the FWER and satisfy (9.20), D maximizes 

inf (h(0) = inf Pe{di,i} . 

A 2 (e) A 2 (e) 

To prove this, consider any other monotone procedure E which controls the 
FWER and satisfying eo,o = do,o, and suppose that ei,i contains a point (t\, < 2 ) 
with ti < C 2 for some i, say fi < C 2 . Then, since E is monotone, it contains the 
quadrant {Ti > ti, T 2 > fe}, and hence 

Po,oo{ei,i} > PoMTi > fi, T 2 > £ 2 } = Po{Ti > ti} > Po{Ti > C 2 } = ol , 

which contradicts strong control. It follows that ei,i is a proper subset of dip, 
and 


Pg{e 1 , 1 } < Pe{di,i} for all 9 . 

Since the inf over ^(e) of both sides is attained at (e, e), 
inf Pe{ei,i} < inf Pe{di,i} , 

as was to be proved. 

To prove (iii), observe that, for any 9 , 

Pg{rejecting at least one false Hi} < P@{rejecting at least one Hi} , 

and so 

inf PMrejecting at least one false H, \ < inf Pefrejecting at least one HA 
6eA i(e) ffeA i(e) 

But, the right side is P e {Pi > C 1 }, and so it suffices to show that D satisfies 
inf Pg{D rejects at least one false HA = P e {Ti > Ci} . 

But, this last result is easily checked. 

Finally, once do,o and dip are determined, so are do.i and di,o by monotonicity, 
and this completes the proof. ■ 

Theorem 9.2.1 provides the maximin test which first maximizes inf /3i(9) and 
then inf /? 2 (d). In the next result, the order in which these aspects are maximized 
is reversed, which results in the stepup procedure U being optimal. 

Theorem 9.2.2 Assume the conditions described at the beginning of this section, 
(i) A monotone decision rule with FWER < a will maximize (9.18) for i = 2 if 
and only if it rejects both hypotheses, i.e., takes decision Mi,i, when 

min(Ti,T 2 ) > Q 


(9.23) 
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and accepts Hi ifT-i < CT, where CT = C 2 is determined by (9.22). Its minimum 
power /32(d) over A 2 (e) is 

P e {min(Ti,T a ) > CT} . (9.24) 

(ii) T/ie monotone procedure with FWER < a and satisfying (9.23) maximizes 
(9.18) for i = 1 if and only it takes the following decisions: 

iio,i = (Ti < CT, T 2 > C 2 *} 

W i,o = (Ti > C 2 , T 2 < CT} 

uo,o = (Ti < CT, T 2 < C 2 } |>M 1 
where C 2 is determined by 

C’o.oK.o} = a • (9.25) 

Its minimum power (3\(0) over A\(e) is 

P e {Ti > C 2 *} . (9.26) 

(in) The result (ii) holds if criterion (9.18) with i = 1 is replaced by (9.19) with 
i = 1. 

Note that 

CT = C 2 < Ci < C 2 * . (9.27) 

Also, the best minimum power (3\(9) over Ai(e) for the procedure of Theorem 
9.2.1 exceeds that for Theorem 9.2.2, while the situation is reversed for the best 
minimum power of (3 2 (9 ) over A 2 (e). This is, of course, as it must be since the 
first of these two procedures maximized the minimum value of (3\(9) over Ai(e) 
while the second maximized the minimum value of (3 2 (9 ) over A 2 (e). 

Proof, (i) Suppose that E is any other monotone procedure with FWER < a. 
Assume there exists (ti, t 2 ) £ ei,i such that ti < C( for some i, say t\ < CT- 
Then, 

T(),oo{ei,i} > -Po,oo{Ti > ti, T 2 > t 2 } = Po{Ti > i 1 } > Po{T\ > C 1 }} = a , 
which would violate the FWER condition. Therefore, ei,i C ui,i. But then 

inf fa(0) 

A 2 (e) 

is smaller for E than for U, as was to be proved. 

(ii) Note that the claimed solution inf J 4 1 ( e ) (3(6) is given by 

inf Rfl{Mo,o} = Pe,— oo{Mo,o} = Pe{Tl > Ci } . 

SeA i(e) 

We now seek to determine u 0 , 0 , as in Theorem 9.2.1, but with the added constraint 
that uo,o C Ui tl . 

To prove optimality for the claimed solution, suppose that E is another mono¬ 
tone procedure controlling FWER at a, and satisfying ei,i = wi,i with 111,1 given 
by (9.23). Assume (ti, t 2 ) £ eo,o but ^ uo,o, so that Ti > C| for some i, say 
i=l. Then, 

Ft, — 00 {C0,0} > Pe,-ao{Tl < tl, T 2 < t 2 } = Pe{T\ < tl} > Pf{T > C 2 } . 
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Hence, 

Pe,- ooRo} < Pe{T ! > C 2 *} , 

so that E cannot be optimal. It follows that eo,o C uo,o- But if eo,o is a proper 
subset of uo,o, the set eg,o i n which E rejects at least one hypothesis contains 
Uq }0 and so 

Bo,o{eg i0 } > Bo,o{«o,o} = « , 

and E does not control the FWER at a. 

Finally, the proof of (iii) is analogous to the proof of (iii) in Theorem 9.2.1. ■ 
Theorems 9.2.1 and 9.2.2 have natural extensions to the case of s hypotheses 
where the aim is to maximize the s quantities (9.18). As in the case s = 2, these 
maximizations lead to different procedures, and one must choose their order of 
importance. The two most natural choices are the following: 

(a) Begin by maximizing inf/3i(0), which will lead to an optimal choice for 
d o,o,...,o, the decision to accept all hypotheses. With do,...,o fixed, the par¬ 
tition of do,...,o into the subsets in which the remaining decisions should be 
taken is begun by maximizing the minimum of 02 {9) over the part of the 
parameter space in which at least 2 hypotheses are false, and so on. 

(b) Alternatively, we may start at the other end by maximizing inf /3 S (9), and 
from there proceed downward. 

We shall here only state the result for case (a). For its proof and the statement 
and proof for case (b), see Lehmann, Romano, and Shaffer (2003). 

Theorem 9.2.3 Under the assumptions made at the beginning of this section, 
among all monotone procedures E with FWER < a, the stepdown procedure D 
with critical constants given by (9.16), has the following properties: 

(i) it maximizes inf (3\( 8 ) over A\(e) 

(ii) it maximizes inf/3 2 ($) over A 2 (e) subject to the additional condition e s ,2 C 
d s ,i, where e s ,i and d s ,i denote the events that the procedures E and D reject at 
least i of the hypotheses Hi,, H s . 

(iii) Quite generally, it maximizes both (9.18) and (9.19) among all monotone 
procedures E with FWER < a and satisfying e s ,i C d s ,i- 1 . 

We shall now provide a canonical form for certain stepdown procedures, 
and particularly for the maximin procedure D of Theorem 9.2.3, that provides 
additional insights. 

Let pi ,..., p s be the p- values of the statistics T),..., T s , and denote the ordered 
p -values by p (!) < • ■ • < p( s ). If F denotes the common marginal distribution of 
Ti under 9i = 0, we have that 

Pi = 1 - F(Ti) (9.28) 

and hence that 

p (1) = 1 - F(T (S) ) . (9.29) 

In terms of the p’s, the steps of the stepdown procedure 
T(s ) > Ci, T( s -i) > C 2 ,... 


(9.30) 
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are equivalent respectively to 

P(i) < ai, p( 2 ) < 02 , • • • (9.31) 

for suitable a’s. In particular, T( s ) > Ci is equivalent to pm < ai. Thus, by 
(9.29), T( s ) < Ci is equivalent to F(T( a )) < 1 — ai, so that 

Ci =F~ 1 (l-ai) . 


On the other hand, if G s denotes the distribution of T( s ) when all the B % are 0, it 
follows from (9.16) that C\ = G^ 1 (l — a) and hence that 

1-ai =F{G7 1 (l~a)] , (9.32) 

which gives ai as a function of a. 

It is of interest to determine the ranges of the step levels «i,... , a s . Since 
G s (t) < F(t) for all t, it follows from (9.32) that 1 — an > 1 — a for all F, or 

on < a for all F , (9.33) 

with equality when F = G, i.e., when Ti = • • • T s . To find a lower bound for on, 
put u = G _1 (l — a) in (9.32) so that 

1 — ai = F(u) with 1 — a = G s (u) (9.34) 

and note that for all u 


1 — G s (u ) = P{at least one Ti > u} < P{Ti > u} = s[l — F(u)] . 


Thus, 


and hence 


F(u) < 1- -fl — G(u)l = 1- - 
s s 


ai > 


a 

s 


(9.35) 


We shall now show that the lower bound (9.35) is sharp by giving an example 
of a joint distribution of (Ti,..., T s ) for which it is attained. 


Example 9.2.1 (A Least Favorable Distribution) Let U be uniformly dis¬ 
tributed on (0,1) and suppose that when Hi ,..., H s are all true, 

Fi = U, Y 2 = U+ -(mod 1),... ,Y S = U + '-(mod 1) . 

s s 

Since (Fi,...,F s ) does not satisfy our assumption of exchangeability, replace 
it by the exchangeable set of variables (X\, ..., X 3 ) = (F 7r ( 1 ),..., F^gj), where 
(7 r(l),..., 7r(s)) is a random permutation of (1,..., s) (and independent of U). 
Let Ti = 1 — X, and suppose that Hi is rejected when Ti is large. To show that 

F[G7 1 (l-a)] = l—J , (9.36) 

note that the T’s are uniformly distributed on (0,1) so that (9.36) becomes 

G s (l — —) = 1-a . 
s 

Now 

1 — G s (l — —) = P {at least one Ti > 1 — —} = Pjat least one Xi < —} . 

s s s 
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But the events {Xi < a/s} are mutually exclusive, and therefore 

S 

Plat least one Xi < —} = P{Xi < — } = s • — = a , 
s z ' s s 

i=1 

which implies (9.36). ■ 

We shall now briefly sketch the corresponding development for a. 2 , defined by 
the fact that j 3( 2 ) < a 2 is equivalent to T(s-i) > G 2 , where C 2 is determined by 
(9.16) so that 

G,_i(C a ) = 1-a . 

Note that G s _ 1 is not the distribution of i.e., of the 2nd largest of s T’s, 

but of the largest of Ti,..., T s _ 1 (i.e., the largest of s — 1 T’s). In exact analogy 
with the derivation of (9.32) it now follows that 

1 - a 2 = F[G7_\ (1 - a)] . (9.37) 

The maximum value of 02 , as in the case of ai, is equal to a and is attained 
when Ti = • • • = T s -i- 

The argument giving the lower bound shows that 02 > a/(s — 1). To show 
that this value is attained, we must find an example for which 

G._i(l- -At) = l~a . 

Example 9.2.1 will serve this purpose since in that case 

1 — G s -i(l- -~t) — P{at least one of Ti ,... ,T S ~ 1 >1 -—-} 

s — 1 s — 1 

i= 1 

for any a satisfying a/(s — 1) < 1/s, i.e., a < (s — l)/s. 

Continuing in this way we arrive at the following result. 

Theorem 9.2.4 (i) The step levels at defined by the procedure D with critical 
constants given by (9.16) and the equivalence of (9.30) and (9.31) are given by 

1 - <**.#= F[G,- i+ i(l - a)] , (9.38) 

where Gj is the distribution of max(Ti, 

(ii) The range of at is 

-%—- < on < a . (9.39) 

s — i + 1 

Furthermore, the upper bound a is attained when T\— —T s , i.e., when there 
really is no multiplicity. The lower bound a/(s — i + 1) is attained when the 
distribution ofTi,... ,T s _i+i is that of Example 9.2.1. 

Not all points in the s-dimensional rectangle (9.39) are possible for (au,..., a s ). 
In particular, since for all t 

Gift) > Gj(t) when i < j , 
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it follows that 

ai < Q 2 < • • • < a s . (9.40) 

The values of ai given by (9.38) can be determined when the joint distribution 
of (Tj,..., T s ) (and hence the distributions G s ) is known. Consider, however, 
the situation in which the common marginal distribution F of the statistics T) 
needed to carry out the tests of the individual hypotheses Hi at a given level is 
known, but the joint distribution of the T ’s is unknown. Then, we are unable to 
determine the step levels (9.38). 

It follows, however, from (9.39) that the procedure (9.31) with 

at = a/(s — i + 1) for i = 1,..., s (9-41) 

will control the FWER for all joint distributions of (Tj,..., T 3 ), since these levels 
are conservative in all cases. This is just the Holm procedure of Theorem 9.1.2. 

Also, none of the levels ai can be larger than a/(s — i + 1) without violating 
the FWER condition for some distribution. To see this, note that if levels ai 
are used in Example 9.2.1, it follows from the discussion of this example that 
when i of the hypotheses are true, the probability of at least one false rejection 
is (s — i + 1 )ai. Thus, if ai exceeds a/(s — i + 1), the FWER condition will be 
violated. 

Of course, if the class of joint distributions of the T’s is restricted, the range of 
at may be smaller than (9.39). For example, suppose that the T’s are independent. 
Then, putting u = Gj" 1 (l — a) as before, we see from (9.34) that 

1 — ai = F(u) and 1 — a = F s (u) 

so that 

cm = 1 — (1 — a) 1,s , 

and more generally that 

Qi = l-(l-a) 1/(s - i+1) . 

In this case, the range reduces to a single point. 

More interesting is the case of positive quadrant dependence when 

G s (u) > F s (u) 

and hence 

1 — a > (1 — ai) 1/s 

and 

1 - (1 -a) 3 < ai < a . (9.42) 

The bounds are sharp since the upper bound is attained when T\ = ■ ■ ■ = T s and 
the lower bound is attained in the case of independence. 


9.3 The Hypothesis of Homogeneity 

The previous section dealt with situations in which each of the parameters varies 
independently, so that any subset of the hypotheses H \,..., H a can be true with 
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the remaining ones being false. This condition is not satisfied, for example, when 
the set of hypotheses is 

Hij : Oi = 6 j , i < j (9.43) 

for all (“j pairs i < j. Then, for instance, the set {H 1 , 2 , # 2 , 3 } can not be the 
set of all true hypotheses since the truth of Hi ,2 and H 2,3 implies the truth of 
Hi, 3 . It follows from this transitivity that the set of true hypotheses constitutes 
a partition of the /it’s, say 

Mil = ' ■ ' = i Mir+l = ' ' ' = M*r+fc i " " ' ■ (9.44) 

All pairs within a set of the partition are equal, and two fxs in different sets are 
unequal. We shall therefore use the statement [m 1 = • • • = /r, r as shorthand for 
the statement that all hypotheses Hk,i with ( k , l) any pair of subscripts from the 
set {ii,... ,i r } are true. 

Unfortunately, the results of the tests of the hypotheses (9.43) do not share this 
simple structure since it is possible to accept H 1,2 : pi = P 2 and # 2,3 : p 2 = M 3 
while rejecting Hi ,3 : p 1 = P 3 . We shall return to this point at the end of the 
section. 

We shall now consider the simultaneous testing of the ( 2 ) hypotheses (9.43) 
by means of a Holm type stepdown procedure, as in the preceding section. We 
assume that statistics Tij are available for testing the individual hypotheses Hij. 
I 11 the case of normal variables with sample means Xi and common variance <r 2 , 
these would be the statistics Tij = \Xi — Xj\/a. The procedure begins with 
the largest of the T’s corresponding to the pair (i,j) with the largest difference 
|Xi — Xj\. This would be tested at level a/( 2 ), since ( 2 ) is the total number of 
hypotheses being tested. If this hypothesis is accepted, all the hypotheses (9.43) 
are accepted and the procedure is terminated. I 11 the contrary case, we next test 
the second largest of the T’s at level a/^j — 1), and so on. By Theorem 9.1.2, 
this procedure controls the FWER, regardless of the joint distribution of the T,y. 

However, the fact that the parameters dij = pi — pj do not vary independently 
but are subject to certain logical restrictions enables us to do better. To illustrate 
the situation, suppose that s = 6. Let 

A'(i) < • • • < X( 3 ) 

denote the ordered values of the sample means, and let /Lqq be the mean corre¬ 
sponding to X(i). At the first stage, we test/^(i) = M( 6 )- If (-^( 6 ) — ^( 1 ))/^ < C, we 
accept all the hypotheses Hij and terminate the procedure. If (X(e) > 

C, we reject the hypothesis /i(i) = /i( 6 ) and test the largest of the differences 
A( 6 ) — Xp) and X( 5 ) — 

Let us now express the rule in terms of the p- values. By (9.28), 

Pi,i = 1 - F{T itj ) , (9.45) 

where F is the distribution of |X, — Xj\/a, and the rejection region |A(g) — 
A(i)|/<t > C becomes pi ,6 < a/( 2 ). If the next largest difference is (A( 5 ) — 
A(i))/< 7 , say, we would at the next step compare 1 — T[(A'( 5 ) — X(\))/a\ with 
<>:/(( 2 ) — 1 ), and so on. 

However, using the relations between the differences | X.j — X t \, we can in the 
present situation do considerably better than that. 
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To see this, consider the case where one hypothesis is false, say /ri ^ /i 4 . Then, 
/.(2 cannot be equal to both /xi and /r 4 ; thus, one of the hypotheses /r 1 = /r 2 or 
= /r 4 must be false, and similarly for /j. 3 , ^5 and /i 6 - Therefore, at step 2 when 
one hypothesis is false, at least 5 must be false, and the number of possible true 
hypotheses is not (®) — 1 = 14 but instead is (®) — 5 = 10. 

An argument similar to that of Theorem 9.1.2 shows that at the second step of 
the Holm procedure, we can increase a/14 to a/10 without violating the FWER. 
Indeed, suppose that at least one hypothesis is false, and so at most 10 are true. 
Let I be the set (i,j) of true hypothesis Hi,j, and let 

j5min = min {p itj : (i, j) G 1 } . 

Then, if a false rejection occurs, it occurs at step 1 or step 2, but in either case, 
it must be that p m in < a/10. But, by Bonferroni, 

P{P^ n < ^} < E P ^:> < < \I\ • ^ < « • 

Similar improvements are possible at the succeeding steps. 

As pointed out at the beginning of the section, each set of true hypotheses 
(9.44) corresponds to a partition of the integers {l,...,s} and determines the 
corresponding number of possible true hypotheses 



The following table, adapted from Shaffer (1986), where this improvement was 
first proposed, shows for s = 3 to 10 the maximum possible number of true 
hypotheses. 


Table 9.1. 

Possible Number of True Hypotheses 


s 

Total # of Hypotheses Hj j 

Possible Number of True Hypotheses 

3 

3 

0, 1,3 

4 

6 

0-3, 6 

5 

10 

0-4, 6, 10 

6 

15 

0-4, 6, 7, 10, 15 

7 

21 

0-7, 9, 10, 11, 15, 21 

8 

28 

0-13, 15, 16, 21, 28 

9 

36 

0-13, 15, 16, 18, 21, 22, 28, 36 

10 

45 

0-18, 20, 21, 22, 24, 28, 29, 36, 45 


Here, for example, the entries 0-4, 6 , 10 for s = 5 correspond to the numbers 
of possible true pairs /q = Hj for the given partitions. Thus, the case /ri = H 2 = 
H 3 = /r 4 = fjs corresponds to the partition (/*i,..., ^ 5 ) and allows (^) = 10 
true pairs /q = Hj. The case /ii 7 ^ M 2 = P 3 = Pa = /rs corresponds to the 
partition {/ii}, {/i 2 , P 3 , /.t 4 , Ms} and allows ( 2 ) = 6 true pairs /.q = \Xj. The case 
/.ti = pi ^ = /i 4 = Ms corresponds to the partition {^ 1 ,^ 2 }, {^3, Pi, /is} and 

allows ( 2 ) + ( 2 ) = 4 t rue pairs, and so on. 
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The reductions are substantial. At the second step, for example, (*) — 1 = 
(s— 2 Ks+i) decreased = (o- 2 )^ 8 - 1 ) . the difference is s — 2 and hence 

tends to oo as s —> oo. 

Shaffer gave a simple algorithm for finding the maximum number of true hy¬ 
potheses given that i hypotheses have been declared false. Use of the procedure 
based on these numbers has been called Si (Donoghue (2004)). A more powerful 
procedure, called S 2 , uses the maximum number of true hypotheses given the 
particular hypotheses that have been declared false. A difficulty with the S 2 pro¬ 
cedure, particularly when s gets large, is to determine the maximum numbers of 
true hypotheses that are possible at any given step. An algorithm to deal with 
this problem has been developed by Donoghue (2004). 

Like the Holm procedure itself, this modification only utilizes the marginal 
distributions of the statistics = \Xi — Xj\/a, which are proportional to t- 
statistics. However, under the assumption of normality, the joint distribution of 
these statistics is also known, and so the levels (9.38) could be used - with s —* + l 
replaced by the number of true hypotheses possible at this stage - to achieve a 
further improvement. Note, however, that this can be difficult because the set 
of possible true hypotheses is not unique, so a number of joint distributions 
would have to be determined. An alternative approach that incorporates logical 
constraints and dependence among the test statistics is described in Westfall 
(1997). 

Multiple comparison procedures, many of them going back to the 1950’s, em¬ 
ploy not only tests based on ranges, but also the corresponding procedures based 
on F-tests. Most of them are special cases of a general class of stagewise step- 
down procedures which we shall now consider for testing homogeneity of s normal 
populations with common variance based on samples of equal size rii = n. 

For this purpose, we require a slight shift of point of view. The hypothesis 
H : Hi 1 = ■ ■ ■ = !M r was previously considered as shorthand for the hypothesis 
that all pairs within this set are equal, and the problem as that of testing these u) 
separate hypotheses. Now we shall also admit the more traditional interpretation 
of H as a hypothesis in its own right for which a global test such as an F -test might 
be appropriate. It should be emphasized that, logically, the two interpretations 
are of course equivalent; they differ only in the way they are analyzed. 

The first step in the class of procedures to be considered is to test the 
hypothesis 

H s : m = • • • = (i s (9.46) 

either with a range test or an F -test at a critical value C a corresponding to some 
level ois . In case of acceptance, the means are judged to exhibit no significant 
differences, the set {/in, ..., fj, s } is declared homogeneous, and the procedure ter¬ 
minates. If H 1 is rejected, a search for the source of the differences is initiated 
by proceeding to the second stage, which consists in testing the s hypotheses 

Hg — \ ,i • /Tl — ' ' ' — l-li—1 — — ' * * — 

each by means of a range or an F test at a common critical value corresponding 
to a common level Q s -i. For any hypothesis that is accepted, the associated set 
of means (and all of its subsets) are judged not to have shown any significant 
differences and are not tested further. For any rejected hypothesis, the s — 1 
subsets of size s — 2 are tested (except those that are subsets of an (s — l)-set 
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whose homogeneity has been accepted). At stage i, the k = s — i + 1 differences 
would be tested for all subsets that are not included in an (s — * + 2)-set whose 
homogeneity has been accepted. Moreover, assume that all tests at stage i are 
performed at the same level, and denote this level by au corresponding to a 
critical value Ck■ The procedure is continued in this way until no hypotheses are 
left to be tested. 

To see the relation of this stagewise procedure to the fully sequential approach 
described at the beginning of the section which is based on the ordered differences 
\Xi — Xj\/&, let us compare the two procedures when all the tests of the stagewise 
procedure are based on standardized ranges. In both cases the first step is based 
on |X( S ) — X(i) |/<t and accepts the homogeneity of {/Hi,..., /.is} if this statistic is 
> some constant C s . The stagewise procedure next compares the two subranges 

|A'( s) - X (2 )|/d and |A (s _i) - A' ( i)|/d 

with a common critical value C s -i- Note, however, that if the larger of the two 
is < C s -!, this will a fortiori be true of the smaller one. This second step could 
thus equally well be described as comparing the second largest of the ranges 
| A., — Xj\/a with C s -!, and in case of acceptance terminating the procedure. In 
case of rejection, we would next compare the smaller of the two (s — l)-ranges 
with C s -!. Continuing in this way, Ct would be used to test all eligible j-ranges. 

The fully sequential procedure described at the beginning of the section also 
would terminate at the second step if the larger of the two (s — 1) ranges is too 
small. But if it is large enough for rejection, the next step would differ in two 
ways: (i) the critical level would be lowered further; (ii) the next test statistic 
would be the 3rd-largest of the differences |A* — A j\/a, which may but need not 
coincide with the smaller of the (s — l)-ranges. Thus, the two procedures differ 
slightly, although they are very much in the same spirit. 

To complete the description of a stagewise procedure, once the test statistics 
have been chosen, it is necessary to specify the critical values Ci,... ,C S for 
the successive stages or equivalently the levels 0 . 2 , ■ ■ ■ ,a s at which the tests are 
performed. Note that there is no on of C 1 since at the sth stage only singlets are 
left, and hence there are no longer any hypotheses to be tested. 

Before discussing the best choice of a’s let us consider some specific methods 
that have been proposed in the literature. Additional properties and uses of some 
of these will be mentioned at the end of the section. 

(i) Tukey’s T-method. This procedure employs the Studentized range test at 
each stage with a common critical value Ck = C for all k. The method has 
an unusual feature which makes it particularly simple to apply. In general, in 
order to determine whether a particular subset So of means should be called 
nonhomogeneous, it is necessary to proceed stagewise since the homogeneity of 
So itself is not tested unless homogeneity has been rejected for all sets containing 
So. However, with Tukey’s T-method it is only necessary to test So itself. If the 
Studentized range of So exceeds C, so will that of any set containing So, and So 
is declared nonhomogeneous. In the contrary case, homogeneity of So is accepted. 
The two facts which jointly eliminate the need for a stagewise procedure in this 
case are (a) that the range, and hence the Studentized range, of So cannot exceed 
that of any set S containing So, and (b) the constancy of the critical value. The 
next method applies this idea to a procedure based on T’-tests. 
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(ii) Gabriel’s simultaneous test procedure. E-statistics do not have property 

(a) above. However, this property is possessed by the statistics uF, where v is 
the number of numerator degrees of freedom (Problem 9.17). Hence a procedure 
based on E-statistics with critical values Ck = C/(k — 1) satisfies both (a) and 

(b) , since fc — 1 is the number of numerator degrees of freedom when k means are 
tested, that is, at the s — k + 1st stage. This procedure, which in this form was 
proposed by Gabriel (1964), permits the testing of many additional hypotheses 
and when these are included becomes Scheffe’s S-method, which will be discussed 
in Sections 9.4 and 9.5. 

(iii) Fisher’s least-significant-difference method. This procedure employs an E- 
test at the first stage, and Studentized range tests with a common critical value 
C 2 = ■ ■ ■ = C s at all succeeding stages. The constants C s and C 2 are related by 
the fact that the first stage F -test and the pairwise f-test of the last stage have 
the same level. 

The usual descriptions of (i) and (iii) consider only the first and last stages 
of these procedures, and omit the conclusions which can be drawn from the 
intermediate stages. 

Several classes of procedures have been defined by prescribing the significance 
levels ak, which can then be applied to the chosen test statistics at each stage. 
Examples are: 

(iv) The Newman-Keuls levels: 


ak = a. 


(v) 

The Duncan levels: 

ak = 

(vi) 

The Tukey levels: 



ak = j 

' 1-7* 
, 1-7*' 


In both (v) and (vi), 7 = 1 — a 2 - 

Most of the above methods and some others are reviewed in the books by 
Hoc.hberg and Tamhane (1987) and Hsu (1996). 

Let us now consider the choice of the levels ak more systematically. For this 
purpose, denote the probability of at least one false rejection, that is, of re¬ 
jecting homogeneity of at least one set of ffs which in fact is homogeneous, by 
a(/ri,..., fj, s ). As before we impose the restriction that the FWER should not 
exceed a, so that 


a(m ,..., Us) < a for all (/ri,... ,/r s ) . (9.47) 

In order to study the best choice of 02 , ■ ■ ■ a s subject to (9.47), let us begin by 
assuming a 2 to be known, say a 2 = 1. Then the E-tests are replaced by x 2 -tests 
and the Studentized range tests by range tests; the latter reject when the range 
of the subgroup being tested is too large. 

To evaluate the maximum of the left side of (9.47), suppose that the ffs fall 
into r distinct subgroups of sizes Vi,... ,v r (X) v i = s), say 

Til = ■ ■ ■ = Tiv 1 i Tiv 1 + 1 = ' ' ' = Ti vi +v 2 i ■ ■ ■ > (9.48) 
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where (ii,..., i s ) is a permutation of (1,..., s). Then, both x 2 an d range statistics 
for testing the r hypotheses 

H[ : Mil = ••• = Mi„ l ; H' 2 : IM V1+1 = ■■■ = m Vl+V2 ;.. ■ (9.49) 

are independent. The following result then gives conditions on the individual 
levels at so that the FWER is controlled. 


Lemma 9.3.1 If the test statistics for testing the r hypotheses (9.49) are inde¬ 
pendent, then the sup of a(p, i,...,/k s ) over all (mi, ..., Ms) satisfying (9.48) is 
given by 

r 

sup a (/in,..., Ms) = 1 - ]^[(1 - a Vi ) , (9.50) 

4=1 

where an = 0. 

Proof. Since false rejection can occur only when at least one of the hypotheses 

(9.49) is rejected, 

a(p, i,..., /is) < P (rejecting at least one H[) 

= 1 — P (accepting all the H[) 

r 

= f-nci-o.), 

i= 1 

the last equality following from the assumption of independence. 

To see that the upper bound (9.50) is sharp, let the distances between the 
different groups of means (9.48) all tend to infinity. Then the probability of 
accepting homogeneity of any set containing {pa 1 ,..., p.i vi } as a proper subset, 
and therefore not reaching the stage at which H[ is tested, tends to zero. The 
same is true for H ' 2 ,..., H' r , and hence o(mi, ..., Ms) tends to the right side of 

(9.50) . ■ 

It is interesting to note that sup o(mi, • • •, Ms) depends only on 02 ,..., a s and 
not on whether x 2 - or range statistics are used at the various stages. In fact, 
Lemma 9.3.1 remains true for many other statistics (Problem 9.18). 

It follows from Lemma 9.3.1 that a procedure with levels ( 02 ,..., a s ) satisfies 
(9.47) if and only if 

r 

JT(1 — a Vi ) > 1 — a for all (vi,...,v r ) with = (9.51) 

4 = 1 

To see how to choose 02 ,..., Q s , subject to (9.47) or (9.51), let us say that 
(oti,... ,at s ) is inadmissible if there exists another set of levels (aj,...,^) 
satisfying (9.51) and such that 

oti < Q-'i for all i,with strict inequality for some i. (9.52) 

These inequalities imply that the procedure with the levels ot, has uniformly 
better chance of detecting existing inhomogeneities than the procedure with levels 
at. The definition is thus in the spirit of a-admissibility discussed in Chapter 6 , 
Section 6.7. ■ 
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Lemma 9.3.2 Under the assumptions of Lemma 9.3.1, necessary conditions for 
(ct 2 , ■ ■ ■, ots) to he admissible are 

(i) «2 < • • • < a s and 

(ii) a s = o s —i = o. 

Proof, (i) Suppose to the contrary that there exists k such that a k +i < otk, and 
consider the procedure in which a[ = Oi for i ^ k + 1 and a' k+l = a k . To show 
that of < o, we need only show that J[[(l — a' Vi ) > 1 — a for all (vi,... ,v r ). If 
none of the v’s is equal to k + 1, then a' Vi = a Vi for all i , and the result follows. 
Otherwise replace each v that is equal to k + 1 by two v’s —one equal to k and 
one equal to 1 —and denote the resulting set of v’s by wi, . w r >. Then 

r r 

JJ(1 “ °4) = IB 1 ” >1- a . 

i =1 i= 1 

(ii) The left side of (9.52) involves a s if and only if r = 1, vi = s. Thus the 
only restriction on a s is a a < a, and the only admissible choice is a a = a. The 
argument for a s -i is analogous (Problem 9.19). ■ 

Part (ii) of this lemma shows that Tukey’s T-method and Gabriel’s simultane¬ 
ous test procedure are inadmissible since in both a s _i < a a . The same argument 
shows Duncan’s set of levels to be inadmissible. [These choices can however be 
justified from other points of view; see for example Spjptvoll (1974) and the com¬ 
ments at the end of the section.] It also follows from the lemma that for s = 3 
there is a unique best choice of levels, namely 

«2 = «3 = ol . (9.53) 

Having fixed a 3 = a s _i = a, how should we choose the remaining a’s? In 
order to have a reasonable chance of detecting existing inhomogeneities for all 
patterns, we should like to have none of the a’s too small. In view of part (i) of 
Lemma 9.3.2, this aim is perhaps best achieved by maximizing 02 , the level at 
the last stage when individual pairs are being tested. 

Lemma 9.3.3 Under the assumptions of Lemma 9.3.1, the maximum value of 
«2 subject to (9.47) is 

a 2 = l-{l-a) [s/2] ~ 1 (9.54) 

where [A] denotes the largest integer < A. 

Proof. Instead of fixing a and maximizing 02 , it is more convenient to fix 02 , 
say at a*, and then to minimize a. The lemma will be proved by showing that 
the resulting minimum value of a is 

a = 1 - (1 - a*) [s/2] . (9.55) 

Suppose first that s is even. Since «2 is fixed at a*, it follows from Lemma 9.3.1 
that the right side of (9.50) can be made arbitrarily close to a given by (9.55). 
This is seen by letting = • • • = u s / 2 = 2. When s is odd, the same argument 
applies if we put an additional v equal to 1. ■ 

Lemmas 9.3.2 and 9.3.3 show that any procedure with a s = 02 , and hence 
Fisher’s least-significant-difference procedure and the Newman-Keuls choice of 
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levels, is admissible for s = 3 but inadmissible for s > 4. The second of these 
statements is seen from the fact that (9.47) implies c *2 < 1 — (1 — a)^^ < a 

when s > 4. The choice a .2 = a s thus violates Lemma 9.3.2(h). 

Once a .2 has been fixed at the value given by (9.54), it turns out that subject 
to (9.47) there exists a unique optimal choice of the remaining a’s when s is odd, 
and a narrow range of choices when s is even. 

Theorem 9.3.1 When s is odd, then 03 ,..., a s are maximized, subject to (9.4 7) 
and (9.54), by 

at = 1 — (1 — a 2 ) [i/2] , (9.56) 

and these values can be attained simultaneously. 

Proof. If we put 7 ; = 1 — ai and 7=1 — a 2 , then by (9.49) and (9.56) any 
procedure satisfying the conditions of the theorem must satisfy 

n>« > i [s/2] =7 (s-i)/2 

Let i be odd, and consider any configuration in which Vi = i and all the remaining 
v’s are equal to 2. Then 

(s—i)/ 2 s. (s —1)/2 

7i7 >7 , 

and hence 

7< > li = 1 - a* ■ (9.57) 

An analogous argument proves (9.56) for even i. 

Consider now the procedure defined by (9.56). This clearly satisfies (9.54), and 
it only remains to check that it also satisfies (9.47) or equivalently (9.51), and 
hence that 

TT 7 K/2] > 7 (s—1)/2 

or that 



Now 'Yl\ v i/2] = (s — 6)/2, where b is the number of odd v’s (including ones). 
Since s is odd, b > 1, and this completes the proof. ■ 

Note that the levels (9.56) are close to the Tukey levels (vi), which are 
admissible but do not satisfy (9.54). 

When s is even, a uniformly best choice is not available. In this case, the Tukey 
levels (vi) satisfy (9.54), are admissible, and constitute a reasonable choice. [See 
Lehmann and Shaffer (1979).] 

So far we have assumed a 2 = 1 in order to get independence of the r test 
statistics used for testing the hypotheses H[, i = 1,..., r. If <r 2 is unknown, the 
X 2 and range statistics are replaced by F and studentized range statistics. These 
are no longer independent but are positively quadrant dependent in the sense 
that 

r 

P{T[ < tu . . . X < tr} > n P{T'i < U} , 

i=1 


(9.58) 
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where T[,..., T' r are the test statistics used for testing H[,..., H' r . This follows 
from the following lemmas. 

Lemma 9.3.4 Let F\(S),..., F r (S) be nondecreasing functions of a random 
variable S. Then, 

r s 

E{U Fi(S)} > n E{F,(S)} , (9.59) 

i=1 i= 1 

provided the expectations exist. 

Proof. By induction, it suffices to consider r — 2. To show Cov[Fi(S), ^(S 1 )] > 
0, assume without loss of generality that -EDF^S)] = 0. Let x be such that 
F 2 (x) = 0; if no such x exists, let x be any point satisfying F 2 (y) > 0 if y > x 
and F 2 (y) < 0 if y < x. Now, 

Cou[E 1 (S'),E 2 (S)] = £{[Fi(S) - Fi(x)] • F 2 {S)} . 

If S > x, Fi(S) — F\{x) > 0 and F 2 (S) > 0, and so the quantity inside the 
expectation is > 0. Similarly, if S < x, Fi(S) — Fi(x) < 0 and F 2 (S) < 0 and so 
the quantity inside the expectation is > 0. ■ 

Lemma 9.3.5 Assume Yi,..., Y r , S are independent, where S is a nonnegative 
random variable. Then, T[ = Yi/S satisfy (9.58). 

Proof. Let Gi denote the distribution of Y). Fix t\, ■ ■ ■ ,t r . By conditioning on 
P{T[ <t 1 ,...,T(<t r } = E[f] Gi(tiS)] . 

i 

Apply Lemma 9.3.4 with Fi(s) = Gi(Ls) to get the last quantity is an upper 
bound for 

n^[Gi(tiS)j=n p { T i < ■■ 

i i 

For this situation, we have the following result. 

Theorem 9.3.2 If the test statistics for testing the r hypotheses (9.49) are 
positively quadrant dependent in the sense of (9.58), then 

r 

sup ... ,Hs) < 1 - 1 _ av i ) 5 (9.60) 

4=1 

where, as before, ai = 0. 

Proof. That the right side of (9.60) is an upper bound for a(p \,..., p s ) fol¬ 
lows from the proof of Lemma 9.3.1 and the assumption of positive quadrant 
dependence. ■ 

Note, however, that we can no longer assert that the upper bound is sharp. 
For the F and Studentized range tests, the sharp upper bound will depend on 
the total sample size n. 

Theorem 9.3.2 guarantees that the procedures using the a-levels derived under 
the assumption of independence, continue to control the FWER even in the case 
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of positive dependence. The proof of Lemma 9.3.2 shows that a s = a s -i = a 
continues to be necessary for admissibility even in the positively dependent case. 
However, the maximization results for 0 : 2 ,... ,a s can then no longer be asserted. 
They nevertheless have the great advantage that they define procedures that 
do not require detailed knowledge of the joint distribution of the various test 
statistics. 

Even in the simplified version with known variance the multiple testing problem 
considered in the present section is clearly much more difficult than the testing of 
a single hypothesis; the procedures presented above still ignore many important 
aspects of the problem. 

1. Choice of test statistic. The most obvious feature that has not been dealt 
with is the choice of test statistics. Unfortunately it does not appear that 
the invariance considerations which were so helpful in the case of a single 
hypothesis play a similar role here. 

2. Order relation of significant means. Whenever two means pi and pj are 
judged to differ, we should like to state not only that /xx yf pj, but that 
if Xi < Xj then also /xx < pj. Such additional statements introduce the 
possibility of additional errors (stating /xx < pj when in fact pi > Pj), and 
it is not obvious that when these are included, the probability of at least 
one error is still bounded by a. [For recent work on directional errors, see 
Finner (1999) and Shaffer (1990, 2002).] 

3. Nominal versus true levels. The levels 02 , •.., a s , sometimes called nominal 
levels, are the levels at which the hypotheses pt = /Uy, pi = Pj = pk ,... are 
tested. They are however not the true probabilities of falsely rejecting the 
homogeneity of these sets, but only the upper bounds of these probabilities 
with respect to variation of the remaining /x’s. The true probabilities tend 
to be much smaller (particularly when s is large), since they take into 
account that homogeneity of a set So is rejected only if it is also rejected 
for all sets S containing So. 

4. Interpretability. As pointed out at the beginning of the section, the totality 
of acceptance and rejection statements resulting from a multiple compari¬ 
son procedure typically does not lead to a simple partition of means. This 
is illustrated by the possibility that the hypothesis of homogeneity is re¬ 
jected for a set S but for none of its subsets. As another example, consider 
the case s = 3, where it may happen that the hypotheses pi = pj and 
/Xj = /Xfc are accepted but /xx = pk is rejected. The number of such “incon¬ 
sistencies” and the corresponding difficulty of interpreting the results may 
be formidable. Measures of the complexity of the totality of statements as 
a third criterion (besides level and power) are discussed by Shaffer (1981). 
The inconsistencies and resulting difficulties of interpretation suggest the 
consideration of an alternative formulation of the problem which avoids this 
difficulty. Instead of testing the u) hypotheses Hij : pi = pj, estimate 
the (unknown) partition of the p’s defined by (9.48). Possible approaches 
to such procedures are discussed for example in Hochberg and Tamhane 
(1987, Chapter 10, Section 6) and by Dayton (2003). 
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5. Procedures (i) and (ii) can be inverted to provide simultaneous confidence 
intervals for all differences fj.j — (m. The T-method (discussed in Problems 
9.29-9.32) was designed to give simultaneous intervals for all differences 
Hj — im\ it can be extended to cover also all contrasts in the /Ts, that 
is, all linear functions 'Yfci^n with ^ a = 0, but against more complex 
contrasts the intervals tend to be longer than those of Scheffes S'-method, 
which was intended for the simultaneous consideration of all contrasts. [For 
a comparison of the two methods, see for example Scheffe (1959, Section 
3.7) and Arnold (1981, Chapter 12).] It is a disadvantage of the remaining 
(truly stagewise or sequential) procedures of this section that the problem 
of corresponding confidence sets is considerably more complicated. For a 
discussion of such confidence methods, see Holm (1999) and the references 
cited there. 

6. To control the rate of false rejections, we have restricted attention to proce¬ 
dures controlling the FWER, the probability of at least one error. Instead, 
one might wish to control the false discovery rate as defined at the end 
of Section 9.1; see Benjamini and Hochberg (1995). Alternatively, an opti¬ 
mality theory based on the number of false rejections is given in Spjptvoll 
(1972). Another possibility is the control the fc-FWER, the probability 
of making k or more false rejections, as well as the probability that the 
false discovery proportion exceeds some threshold; see Korn et al. (2004), 
Romano and Shaikh (2004) and Lehmann and Romano (2005). 

7. The optimal choice of the a*, discussed in this section can be further im¬ 
proved, at the cost of considerable additional complication, by permitting 
the a’s to depend on the outcomes of the other tests. This possibility is dis¬ 
cussed, for example, in Marcus, Peritz, and Gabriel (1976); see also Holm 
(1979) and Shaffer (1984). 

The procedures discussed in this section were concerned with testing the equal¬ 
ity of means. In more complex situations, further problems arise. Consider, for 
example, the two-way layout of 7.5 with 

/-At = T- a i T Pi T 7 i,j ( "y ] a i = "y ] Pi = y ] 7 i,i = y ] 7 %,i = 0) • 

i i 

If we are interested in multiple testing of the a’s, fit’s, and q’s, the first question 
that arises is whether we want to treat these three cases (a’s, p’s, q’s) as a single 
family, as two families (the main effects forming one family, the interactions the 
other), or as three families in which each of the three sets is handled separately. 

The most appropriate designation of what constitutes a family depends very 
much on context. Consider, for example, the National Assessment of Educational 
Progress which makes it possible to compare the progress made by any two states. 
For a federal report, the set of all ( 5 2 °) possible hypotheses would constitute an 
appropriate family. However, a particular state would be interested primarily in 
the comparison of its performance with those of the other 49 states, thus leading 
to a family of size 49. A comparison which is not significant in the federal report 
might then turn out to be significant in the state report. Some of the issues 
concerning the most suitable definition of family are discussed in Tukey (1991) 
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and in the books by Hochberg and Tamhane (1987), and Westfall and Young 
(1993). 

We shall in the next two sections consider simultaneous inferences for various 
families of linear functions of means in normal linear models. However, since we 
are assuming fully articulated parametric models, we shall consider the slightly 
more demanding problem of obtaining simultaneous confidence intervals rather 
than restricting attention to hypothesis testing. 

As the simplest example, suppose that Xi,... ,X„ are normal variables with 
means /n,..., fj. s and unit variance. We can then apply to the hypotheses 
Hi : [a = Hifi the approach of Section 9.1 and test these hypotheses by means of 
a stepdown procedure. The resulting acceptance regions can then be converted in 
the usual way into confidence sets. It is shown in Holm (1999) that these sets are 
rather complicated and not rectangular, so that they do not consist of intervals 
for the individual /x;’s. (They can, of course, be enclosed in a larger rectangle, 
but the intervals obtained by such a process tend to be unnecessarily large.) 


9.4 Scheffe’s S'-Method: A Special Case 

If AT,..., X r are independent normal with common variance a 2 and expectations 
E{Xi) = a+pti, confidence sets for (a, (3) were obtained in Section 7.6. A related 
problem is that of determining confidence bands for the whole regression line 
£ = a + (3t, that is, functions L'(f; A'), M'(t; X ) such that 

P{Z/(f; X) < a + fit < M'(t; X) for all t} = 7 . (9.61) 

The problem of obtaining simultaneous confidence intervals for a continuum of 
parametric functions arises also in other contexts. In the present section, a general 
problem of this kind will be considered for linear models. Confidence bands for 
an unknown distribution function were treated in Section 6.13. 

Suppose first that Ai,..., X r are independent normal with variance a 2 = 1 
and with means E(Xi) = p, and that simultaneous confidence intervals are 
required for all linear functions No generality is lost by dividing u;£; 

and its lower and upper bound by \/X/u?> so that attention can be restricted to 
confidence sets 

S(x) = {£ : L(u\x ) < < M(u\x) for all u G U} , (9.62) 

where x, u denote both the vectors with coordinates Xi, Ui and the r x 1 column 
matrices with these elements, and where U is the set of all u with ^ u 2 = 1. The 
sets S(x) are to satisfy 

€ S(X)} = 7 for all (fc,..., £ r ). (9.63) 

Since u = (ui,... ,u r ) G U if and only if —u = (—ui,...,—u r ) G U, the 
simultaneous inequalities (9.62) imply L(—u;x) < — — M(—w,x), and 

hence 

— M{—u\x) < — ~L(—u;x) 


and 


ma x(L(u-,x),—M(—u-,x)) < < min(M(u;x),—L(—u;x)). 



376 9. Multiple Testing and Simultaneous Inference 


Nothing is therefore lost by assuming that L and M satisfy 

L(u;x) = (9.64) 

The problem of determining suitable confidence bounds L(u\x) and M(u;x) 
is invariant under the group Gi of orthogonal transformations 

Gi : gx = Qx,g£ = (Q an orthogonal r x r matrix). 

Writing JD = u'£, we have 

g*S(x) = {Q£ : L(u; x) < u£ < M(u\ x) for all u € U} 

= : L(u;x) < < M(u; x) for all u £ U} 

= {£ : L(Q~ 1 u;x ) < u't; < M(Q~ 1 u;x) for all u £ U}, 

where the last equality uses the fact that U is invariant under orthogonal 
transformations of u. 

Since 


S(gx) — {£ : L(u\ Qx) < u £ < M(u\ Qx) for all u £ U}, 
the confidence sets S(x) are equivariant under Gi if and only if 

L(u; Qx) = L(Q~ 1 u ; x), M(u; Qx) = M(Q~ 1 u; x), 

or equivalently if 

L(Qu\ Qx) = L(u; x), M(Qu\ Qx) = M(u; x) (9.65) 

for all x, Q and u £ U, 

that is, if L and M are invariant under common orthogonal transformations of u 
and x. 

A function L of u and x is invariant under these transformations if and only 
if it depends on u and x only through u'x, x'x, and u'u [Problem 9.23(i)[ and 
hence (since u'u = 1) if there exists h such that 

L(u;x) = h(u'x,x'x). (9.66) 

A second group of transformations leaving the problem invariant is the group 
of translations 


G 2 : gx — x + a, g£ — £ + a 

where x + a = (*1 + a\,... ,x r + a r ). An argument paralleling that leading to 
(9.65) shows that L(u; x) is equivariant under G 2 if and only if [Problem 9.23(h)] 

L(u;x + a) = L(u;x)+ ^^a,iUi for all x,a, and u. (9.67) 

The function h of (9.66) must therefore satisfy 

h[u (x + a), (x + a)'(x + a)\ = h(u'x, x'x) + a'u 

for all a, x and u £ U, 


and hence, putting x = 0, 


h(ua, a a) = a'u + h(0 , 0). 
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A necessary condition (which clearly is also sufficient) for S(x) to be equivariant 
under both Gi and G 2 is therefore the existence of constants c and d such that 

Six) = : Y, UiXi — c < y Ui£i < y UiXi + d for all u £ U j 

From (9.64) it follows that c = d, so that the only equivariant families S(x) are 


given by 

S(x) = : j^^Mi(a;i — £i)| < c for all u £ 1/ j (9.68) 

The constant c is determined by (9.63), which now reduces to 

Pq < c for all u G 1/ j = 7 . (9.69) 

By the Schwarz inequality UiXi) 2 < )T) Xf, since JD = 1, and hence 

|5>*| < c for all u £ U if and only if Xf < c 2 . (9.70) 

The constant c in (9.68) is therefore given by 

P(Xr < c 2 ) = 7 - (9.71) 

I 11 (9.68), it is of course possible to drop the restriction it € U by writing (9.68) 
in the equivalent form 

Ui{xi - &) 1 (? for all itl . (9.72) 


So far attention has been restricted to the confidence bands (9.62). However, 
confidence sets do not have to be intervals, and it may be of interest to consider 
more general simultaneous confidence sets 

S(x) : Y, Ui € A(u,x) for all u £ U. (9.73) 

For these sets, the equivariance conditions (9.65) and (9.67) become respectively 
(Problem 9.24) 


A(Qu, Qx) — A(u, x) for all 1 , Q and u £ [/ (9.74) 

and 


A(u,x + a) = A(u,x) + u'a for all u, x, and a. (9.75) 

The first of these is equivalent to the condition that the set A(u, x) depends on 
u £ U and x only through u'x and x'x. On the other hand putting x = 0 in 
(9.75) gives 

A(u, a) = A(u, 0 ) + ua. 

It follows from (9.74) that A(u, 0) is a fixed set A\ independent of u, so that 

A(u,x) = Ai + u'x. (9.76) 

The most general equivariant sets (under Gi and G 2 ) are therefore of the form 
y Ui(Xj — £;) G A for all u G U, (9.77) 


where A = — Ai. 
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We shall now suppose that r > 1 and then show that among all A which 
define confidence sets (9.77) with confidence coefficient > 7 , the sets (9.68) are 
smallest 3 in the very strong sense that if Ao = [—co,co] denotes the set (9.68) 
with confidence coefficient 7 , then Ao is a subset of A. 

To see this, note that if Yi = Xi — the sets A are those satisfying 

P fitjYj £ A for all u £ l/'j > 7 . (9.78) 

Now the set of values taken on by uiVi f° r a fixed y = (yi ,..., y r ) as u ranges 

over U is the interval (Problem 9.24) 


!{y) = 


-\IH yh+\jY,y 2 i 


Let c* be the largest value of c for which the interval [—c, c] is contained in A. 
Then the probability (9.78) is equal to 

P{I(Y) CA} = P{I(Y) C [-c*,c*]}. 

Since P{I(Y) C A } > 7 , it follows that c* > Co, and this completes the proof. 

It is of interest to compare the simultaneous confidence intervals (9.68) for all 
X) Ui£i, u £ U , with the joint confidence spheres for (£ 1 ,..., £ r ) given by (6.43). 
These two sets of confidence statements are equivalent in the following sense. 


Theorem 9.4.1 The parameter vector (£ 1 ,... ,£ r ) satisfies ^2(Xi — £;) 2 < c 2 if 
and only if it satisfies (9.68). 

Proof. The result follows immediately from (9.70) with X, replaced byXi-(t-i 
Another comparison of interest is that of the simultaneous confidence intervals 
(9.72) for all u with the corresponding interval 

S'(x) = |C : - &)| ^ c '( 9 - 79 ) 

for a single given u. Since 'f2in(Xi — £i)/ y/^2 uf has a standard normal distri¬ 
bution, the constant d is determined by P(x 1 < c' 2 ) = 7 instead of by (9.71). If 
r > 1, the constant c 2 = c 2 is clearly larger than c' 2 = c 2 . The lengthening of the 
confidence intervals by the factor c r /ci in going from (9.79) to (9.72) is the price 
one must pay for asserting confidence 7 for all u ifi instead of a single one. 

In (9.79), it is assumed that the vector u defines the linear combination of 
interest and is given before any observations are available. However, it often hap¬ 
pens that an interesting linear combination 'd-if-i to be estimated is suggested 
by the data. The intervals 

|^Wi(xi-&)| < c^'Y^ul (9.80) 

with c given by (9.71) then provide confidence limits for ^Ui^i at confidence 
level 7 , since they are included in the set of intervals (9.72). [The notation Ui 


3 A more general definition of smallness is due to Wijsman (1979). It has been pointed 
out by Professor Wijsman that his concept is equivalent to that of tautness defined by 
Wynn and Bloomfield (1971). 
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in (9.80) indicates that the u’s were suggested by the data rather than fixed in 
advance.] 


Example 9.4.1 (Two groups) Suppose the data exhibit a natural split into 
a lower and upper group, say ,...,£; fc , and ^,..., £j r _ k , with averages 
and £+, and that confidence limits are required for £+ — £_. Letting X _ = 
(Xq + • • • + Xi k )/k and A'+ = (Xj 1 + • • • + Xj r k )/(r — k ) denote the associated 
averages of the X’s we see that 


X + - X_ - c 


! + ——r < e+ - i- < X+ ~ X- + c 

k r — k 


1 + 1 
fc r — k 


with c given by (9.71) provide the desired limits. Similarly 

c 


Vk " A + 


x+- 


< £+ < *+ + 


(9.81) 

(9.82) 


\/r - k y/r - k 

provide simultaneous confidence intervals for the two group means separately, 
with c again given by (9.71). For a discussion of related examples and issues see 
Peritz (1965). ■ 


Instead of estimating a data-based function Y u-Xi, one may be interested in 
testing it. At level a = 1 — 7 , the hypothesis Y'Ui£i = 0 is rejected when the 
confidence intervals (9.80) do not cover the origin, i.e., when 

Equivariance with respect to the group G\ of orthogonal transformations 
assumed at the beginning of this section is appropriate only when all linear combi¬ 
nations Y, u i£i with u £ U are of equal importance. Suppose instead that interest 
focuses on the individual means, so that simultaneous confidence intervals are re¬ 
quired for £ 1 ,..., £ r . This problem remains invariant under the translation group 
G 2 • However, it is no longer invariant under Gi, but only under the much smaller 
subgroup Go generated by the n! permutations and the 2 ™ changes of sign of the 
X’s. The only simultaneous intervals that are equivariant under Go and G 2 are 
given by [Problem 9.25(i)] 

S(x) = {£ : Xi — A < < Xi + A for all i} (9.83) 

where A is determined by 

P[S(X)] = P{ max \Yi\ < A) = 7 (9.84) 

with Yl, ..., Y r being independent N{ 0,1). 

These maximum-modulus intervals for the £’s can be extended to all linear 
combinations Y of the ^’s by noting that the right side of (9.83) is equal to 
the set [Problem 9.25(h)] 

{(; : £>(* — Ci)| < for w } ’ (9.85) 

which therefore also has probability 7 , but which is not equivariant under Gi. A 
comparison of the intervals (9.85) with the Scheffe intervals (9.72) shows [Problem 
9.25(iii)[ that the intervals (9.85) are shorter when Y u j£j = 5* (i- e - when Uj = 1 
for j = i, and Uj = 0 otherwise), but that they are longer for example when 
«! = ••• = u r - 
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9.5 Scheffe’s S'-Method for General Linear Models 

The results obtained in the preceding section for the simultaneous estimation of 
all linear functions ^ when the common variance of the variables X; is known 
easily extend to the general linear model of Section 7.1. In the canonical form 
(7.2), the observations are n independent normal random variables with common 
unknown variance a 2 and with means E(Yi) = r/i for i = 1,..., s and E(Yi) = 0 
for i = s + 1,».., n. Simultaneous confidence intervals are required for all linear 
functions XMi with it £ U, where U is the set of all u = (iti,..., u r ) 
with £[=i u 2 = 1. Invariance under the translation group Y( = Yi + cn, 

i = r + 1,..., s, leaves Yi,..., Y r ; Yi+i,..., Y n as maximal invariants, and suf¬ 
ficiency justifies restricting attention to Y = (Yl, ..., Y r ) and S 2 = X^7=s+i G• 
The confidence intervals corresponding to (9.62) are therefore of the form 

r 

L(u;y, S) < uirg < M(u;y, S) for all it £ U, (9.86) 

i= 1 

and in analogy to (9.64) may be assumed to satisfy 

L{u\ y, S) = y, S). (9.87) 

By the argument leading to (9.66), it is seen in the present case that 
equivariance of L(u; y, S) under G i requires that 

L{u\ y, S) = h(uy, y'y, S), 

and equivariance under G 2 requires that L be of the form 

r 

L(u\ y,S) = myi - c(S). 

i =1 

Since a 2 is unknown, the problem is now also invariant under the group of scale 
changes 

G 3 :y' i =by i (i = 1,... ,r), S' = bS (b > 0). 

Equivariance of the confidence intervals under G 3 leads to the condition [Problem 
9.26(i)] 

L(it; by , bS) = bL(u; y, S ) for all b > 0, 

and hence to 

b ^2 u iVi ~ c ( bs ) = b Uiyi ~ C G) J 
or c(bS) = bc(S). Putting S = 1 shows that c(S) is proportional to S. Thus 
L(u; y,S) = - cS, M (u; y, S) = ^ myt + dS, 

and by (9.87), c = d, so that the equivariant simultaneous intervals are given by 
u^i — cS < ^2 UiTji < utyi + cS for all it £ U. (9.88) 
Since (9.88) is equivalent to 

T,(Vi - Vif < 2 
02 — C ’ 
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the constant c is determined from the ^-distribution by 



As in (9.72), the restriction u € U can be dropped; this only requires replacing c 
in (9.88) and (9.89) by c-y/E u 2 = Cy/Yax^UiY / a 2 . 

As in the case of known variance, instead of restricting attention to the confi¬ 
dence bands (9.88), one may wish to permit more general simultaneous confidence 
sets 


y uiT)i € A(u\ y, S). (9.90) 

The most general equivariant confidence sets are then of the form [Problem 
9.26(h)] 

E Myi-m) € A for all u€U , (g.gi) 

and for a given confidence coefficient, the set A is minimized by Aq = [—c, c], so 
that (9.91) reduces to (9.88). 

For applications, it is convenient to express the intervals (9.88) in terms of 
the original variables X t and Suppose as in Section 7.1 that Xi,... ,X n are 
independently distributed as IV (£j, a 2 ), where £ = (£i,..., £ n ) is assumed to lie 
in a given s-dimensional linear subspace (s < n). Let V be an r-dimensional 
subspace of J][ n (r < s), let be the least squares estimates of the £’s under 
fin, and let S 2 = E(Xi — £i) 2 ■ Then the inequalities 


Vi£i — cS 

for all v £ V, (9.92) 



Var 


E Vi£i) 


< y Vi(,i < y Viji + cS\ 


Var 


E «*&) 


with c given by (9.89), provide simultaneous confidence intervals for E v i£i for 
all v £ V with confidence coefficient 7 . 

This result is an immediate consequence of (9.88) and (9.89) together with the 
following three facts, which will be proved below: 

(i) If E“=i u iVi = E"=i then Ei=i UiYi = £"=1 
(«) EE s+i y = E”=i(^-6) 2 , 

To state (iii), note that the 77 ’s are obtained as linear functions of the £’s through 
the relationship 

Vr, Vr+i, ■ ■ ■, Vs, 0, ... , 0)' = C(£i,..., £ n )' (9.93) 


where C is defined by (7.1) and the prime indicates a transpose. This is seen by 
taking the expectation of both sides of (7.1). For each vector u = («i,... ,u r ), 
(9.93) expresses E u iVi as a linear function E v j U ^£j of the £’s. 

(iii) As u ranges over r-space, v ^ = (v[ u \ ... ,v^) ranges over V. 
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Proof of (i) Recall from Section 7.2 that 

n s n 

V;.Y, i.r’- V':V; rirr ■ J2 y f- 

3 =1 i =1 j=s +1 

Since the right side is minimized by r/i = Yi and the left side by £,■ = £j, this 
shows that 




and the result now follows from comparison with (9.93). 

Proof of (ii) This is just equation (7.13). 

Proof of (iii) Since r/i = ^2™ =1 Cij£j, we have Y u iVi = Y v j V ^£j with v^ = 
YYi u i°ij■ Thus, the vectors v ^ = ( v[ u \ ..., v £“^) are linear combinations, with 
weights Mi,... ,u r , of the first r row vectors of C. Since the space spanned by 
these row vectors is V, the result follows. 

The set of linear functions Y v i£ii v £ V, for which the interval (9.92) does 
not cover the origin—that is, for which v satisfies 


yy v %ii > cs\ 



(9.94) 


—is declared significantly different from 0 by the intervals (9.92). Thus (9.94) is 
a rejection region at level a = 1 — 7 of the hypothesis H : Y v i£i = 0 for ah 
v £ V in the sense that H is rejected if and only if at least one v £ V satisfies 
(9.94). If J"I w denotes the (s — r)-dimensional space of vectors v £ J([ n which are 
orthogonal to V, then H states that £ £ J"J , and the rejection region (9.94) is 
in fact equivalent to the F-test of H : £ £ of Section 7.1. In canonical form, 
this was seen in the sentence following (9.88). 

To implement the intervals (9.92) in specific situations in which the correspond¬ 
ing intervals for a single given function Y v i£i are known, it is only necessary to 
designate the space V and to obtain its dimension r, the constant c then being 
determined by (9.89). 


Example 9.5.1 (All contrasts) Let X,j (j = 1 = l,...,s) be inde¬ 

pendently distributed as N(£i,a 2 ), and is suppose V is the space of all vectors 
v = (vi,...,v n ) satisfying 


y^Ui = 0. (9.95) 

Any function Y v i& with v £ V is called a contrast among the £j. The set of 
contrasts includes in particular the differences £+ — £_ discussed in Example 
9.4.1. The space ]~[n is the set of all vectors (£ 1 ,..., £ 1 ; £ 2 , • ■ ■, £ 2 ; £», ■ ■ ■, f s ) and 
has dimension s, while V is the subspace of vectors that are orthogonal to 
(1,..., 1) and hence has dimension r = s — 1. It was seen in Section 7.3 that 
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= X t ., and if the vectors of V are denoted by 


/ Wl Wl W2 W2 Wa W s 

I , . . . , , , . . . , , , ... , 

\ n i n i 722 n 2 n s 

the simultaneous confidence intervals (9.92) become (Problem 9.28) 


V rii 


with s 2 = j2J2{x ij - Xt.) 2 . 

In the present case the space J"I is the set of vectors with all coordinates 
equal, so that the associated hypothesis is H : £1 = • • • = £ s . The rejection region 
(9.94) is thus equivalent to that given by (7.19). 

Instead of testing the overall homogeneity hypothesis H , we may be interested 
in testing one or more subhypotheses suggested by the data. In the situation 
corresponding to that of Example 9.4.1 (but with replications), for instance, 
interest may focus on the hypotheses Hi : = ■ ■ ■ = £ ik and H 2 : = ■ ■ ■ = 

tijs-k- X level a simultaneous test of Hi and H 2 is given by the rejection region 



< E Wi£i < ')Tw i X i .+cS 


T.w 2 


(9.96) 


for all (wi,..., w s ) satisfying Wi = 0, 


E w m{X i .-X? ) ) 2 /(k-l) E (2) m{Xi. ~ X™) 2 /(8 - fc - 1) 

S 2 / (n — s) ’ S 2 /(n — s) ’ 

where E^\ E^> X^\ X^ indicate that the summation or averaging extends 
over the sets (i 1, ..., ik) and (j 1, ... ,j s -k) respectively, S 2 = E E i x ij - x i.) 2 , 
a = 1 — 7 , and the constant C is given by (9.89) with r = s and is therefore the 
same as in (7.19), rather than being determined by the Fk- i,n_s and i, n _ s 

distributions. The reason for this larger critical value is, of course, the fact the 
Hi and H 2 were suggested by the data. The present procedure is an example of 
Gabriel’s simultaneous test procedure mentioned in Section 9.3. ■ 


Example 9.5.2 (Two-way layout) As a second example, consider first the ad¬ 
ditive model in the two-way classification of Section 7.4 or 7.5, and then the more 
general interaction model of Section 7.5. 

Suppose Xij are independent N(£ij, a 2 ) (i = 1,..., a; j = 1 ,...,&), with £ij 
given by (7.20), and let V be the space of all linear functions = E (£»• — 

£..). As was seen in Section 7.4, s = a + b — 1. To determine r, note that V can 
also be represented as Ei=i w i&- with E Wi = 0 [Problem 9.27(i)], which shows 
that r = a — 1. The least-squares estimators were found in Section 7.4 to be 
iij = Xi. + X.j - X.., so that ii. = Xt. and S 2 = E E(-^« “ “ x i + x 

The simultaneous confidence intervals (9.92) therefore can be written as 

El - c5 \/^? - E Wi &- A E wiXi •+ cS \l^r 

a 

for all w with Wi = 0 . 
i= 1 

If there are m observations in each cell, and the model is additive as before, the 
only changes required are to replace X t . by Xi.., S 2 by EEE (Xijk — Xi.. — 
X.j. + X...) 2 , and the expression under the square root by E wf/bm. 



384 9. Multiple Testing and Simultaneous Inference 


Let us now drop the assumption of additivity and consider the general linear 
model £ijt = H + cti + f3j + jij, with fi and the a’s, /3’s, and 7 ’s defined as in 
Section 7.5. The dimension s of is then ab, and the least squares estimators 
of the parameters were seen in Section 7.5 to be 

A = AT., at = AT. - A..., fij = X.j. - A..., 

7 7 = A ij. — A',.. — X.j. + A... 

The simultaneous intervals for all X] WiQi, or for all X] Wj£j.. with X] Wi = 0, are 
therefore unchanged except for the replacement of S 2 = XXA 'ijk — A.,.. — X.j. + 
A...) 2 by S 2 = XXA ijk — Xij.) 2 and of n — s = n — a — 6+1 by n — s = n — ab = 
(m — 1 )ab in (9.89). 

Analogously, one can obtain simultaneous confidence intervals for the totality 
of linear functions X) w ij7iji or equivalently the set of functions X] w ij£ij- f° r the 
totality of w’s satisfying X)i »ij = XX w ij = 0 [Problem 9.27(ii), (iii)]. ■ 


Example 9.5.3 (Regression line) As a last example consider the problem of 
obtaining confidence bands for a regression line, mentioned at the beginning of 
the section. The problem was treated for a single value to in Section 5.6 (with a 
different notation) and in Section 7.6. The simultaneous confidence intervals in 
the present case become 


a + fit — cS 


i + ( t-i ) 2 


n J2(ti-t) 2 


1/2 


< a + /3t 

< a + fit + cS 


1 1 {t-i) 

Ln J2(U-t) 2 \ 


(9.97) 
2 11/2 


where a and fi are given by (7.23), 

S 2 = ^(A - d - fiuf = £( A - A ) 2 - fi 2 J2(ti - i ) 2 


and c is determined by (9.89) with r = s = 2. This is the Working- Hotelling 
confidence band for a regression line. ■ 


At the beginning of the section, the Scheffe intervals were derived as the only 
confidence bands that are equivariant under the indicated groups. If the require¬ 
ment of equivariance (particular under orthogonal transformations) is dropped, 
other bounds exist which are narrower for certain sets of vectors u at the cost 
of being wider for others [Problems 9.26(iii) and 9.32]. A general method that 
gives special emphasis to a given subset is described by Richmond (1982). Some 
optimality results not requiring equivariance but instead permitting bands which 
are narrower for some values of t at the expense of being wider for others are pro¬ 
vided, among others, by Bohrer (1973), Cima and Hochberg (1976), Richmond 
(1982), Naiman (1984a,b), and Piegorsch (1985a, b). If bounds are required only 
for a subset, it may be possible that intervals exist at the prescribed confidence 
level, which are uniformly narrower than the Scheffe intervals. This is the case 
for example for the intervals (9.97) when t is restricted to a given finite interval. 
For a discussion of this and related problems, and references to the literature, see 
for example Wynn and Bloomfield (1971) and Wynn (1984). 
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9.6 Problems 

Section 9.1 

Problem 9.1 Show the Bonferroni procedure, while generally conservative, can 
have FWER = a by exhibiting a joint distribution for (pi ,... ,p 3 ) and satisfying 
(9.4) such that P{miriiPi < a/s} = a. 


Problem 9.2 (i) Under the assumptions of Theorem 9.1.1, suppose also that 
the p-values are mutually independent. Then, the procedure which rejects any 
Hi for which pi < c(a, s) = 1 — (1 — a) 1 ^ 3 controls the FWER. 

(i) Compare a/s with c(a, s) and show 

c(a, s) = log(l ~ a) 
s—too (a/s) a 

For a = .05, this limiting value to 3 decimals is 1.026, so the increase in cutoff 
value is not substantial. 


Problem 9.3 Show that, under the assumptions of Theorem 9.1.2, it is not 
possible to increase any of the critical values ai = a/(s — * + 1) in the Holm 
procedure (9.6) without violating the FWER. 


Problem 9.4 Under the assumptions of Theorem 9.1.2 and independence of the 
p-values, the critical values a/(s — i + 1 ) can be increased to 1 — (1 — a) 1 ^ s ~ t+v> . 
For any i, calculate the limiting value of the ratio of these critical values, as 
s —^ 00 . 


Problem 9.5 In Example 9.1.4, verify that the stepdown procedure based on 
the maximum of Xj/^Uj,j improves upon the Holm procedure. By Theorem 
9.1.3, the procedure has FWER < a. Compare the two procedures in the case 
(Ti t i = 1 , cfjj = p if * 7 ^ j\ consider p = 0 and p —> ± 1 . 


Problem 9.6 Suppose Hi is specifies the unknown probability P belongs to a 
subset of the parameter space uii, for i =t 1,..., s. For any K C {1,let Hk 
be the intersection hypothesis P £ Suppose </>k is level a for testing 

Hk- Consider the multiple testing procedures that rejects Hi if </>k rejects Hk 
whenever i £ K. Show, the FWER < a. [This method of constructing tests that 
control the FWER is called the closure method of Marcus, Peritz and Gabriel 
(1976).] 


Problem 9.7 As in Procedure 9.1.1, suppose that a test of the individual hy¬ 
pothesis Hj is based on a test statistic T n j , with large values indicating evidence 
against the Hj. Assume Hj=i UJ :i is n °t empty. For any subset K of {l,...,s}, 
let c n ,K(a, P) denote an a-quantile of the distribution of max,e/c T n ,i under P. 
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Concretely, 

c n ,K(a,P) = inf{a: : P{maxT n j < x} > a} . (9.98) 

jCJf 

For testing the intersection hypothesis Hk, it is only required to approximate a 
critical value for P G D,gK Because there may be many such P, we define 

Cti,k(1 - a) = sup{c n ,K(l - a,P) : P£ Q ujj} . (9.99) 

jeK 

(i) In Procedure 9.1.1, show that the choice c n ,ic( 1 — a) = c„,k{ 1 — a) controls 
the FWER, as long as (9.9) holds. 

(ii) Further assume that for every subset K C {1, ...,fc}, there exists a 
distribution Pk which satisfies 

Cn,ic(l — Q, P) < Cn.,x(l ~ «, Pk) (9.100) 

for all P such that I(P) D K. Such a Pk may be referred to being least favorable 
among distributions P such that P £ OjeK (B° r example, if Hj corresponds 
to a parameter 9j < 0, then intuition suggests a least favorable configuration 
should correspond to 0j = 0.) In addition, assume the subset pivotality condition 
of Westfall and Young (1993); that is, assume there exists a Po with I(Po) = 
{1,...,«} such that the joint distribution of {T n j : j £ I(Pk)} under Pk is 
the same as the distribution of {T n j : j G I(Pk)} under Po. This condition 
says the (joint) distribution of the test statistics used for testing the hypotheses 
Hj, j G I(Pk) is unaffected by the truth or falsehood of the remaining hypotheses 
(and therefore we assume all hypotheses are true by calculating the distribution 
of the maximum under Po). Show we can use c n ,jc( 1 — a, Po) for c„,k(1 — a). 

(iii) Further assume the distribution of (T n , i,..., T„ iS ) under Po is invariant under 
permutations (or exchangeable). Then, the critical values c„,k(1 — a) can be 
chosen to depend only on \K\. 

Problem 9.8 Rather than finding multiple tests that control the FWER, con¬ 
sider the fc-FWER, the probability of rejecting k or more false hypotheses. For 
a given k, if there are s hypotheses, consider the procedure that rejects any hy¬ 
pothesis whose p -value is < ka/s. Show that the resulting procedure controls the 
fc-FWER. [Additional stepdown procedures that control the number of false re¬ 
jections, as well as the probability that the proportion of false rejections exceeds 
a given bound, are obtained in Lehmann and Romano (2005).] 

Problem 9.9 In general, show that FDR < FWER , and equality holds when 
all hypotheses are true. Therefore, control of the FWER at level a implies control 
of the FDR. 


Section 9.2 

Problem 9.10 . Suppose (X\,... ,X^) T has a multivariate c.d.f. P(-). For 6 G 
]R fc , let Fe(x) = F(x — 9) define a multivariate location family. Show that (9.15) 
is satisfied for this family. (In particular, it holds if F is any multivariate normal 
distribution.) 
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Problem 9.11 Prove Lemma 9.2.2. 

Problem 9.12 We have suppressed the dependence of the critical constants 
Ci,..., C s in the definition of the stepdown procedure D , and now more ac¬ 
curately call them C a , i,..., C SjS . Argue that, for fixed s, C s j is nonincreasing in 
j and only depends on s — j. 

Problem 9.13 Under the assumptions of Theorem 9.2.1, suppose there exists 
another monotone rule E that strongly controls the FWER, and such that 

— ^{ e o,o} f° r $ £ U'o.O 5 (9.101) 

with strict inequality for some 9 £ Wo,o- Argue that the < in (9.101) is an equality, 
and hence eo.oAdo.o has Lebesgue measure 0, where AAB denotes the symmetric 
difference between sets A and B. A similar result for the region di,i can be made 
as well. 

Problem 9.14 In general, the optimality results of Section 9.2 require the pro¬ 
cedures to be monotone. To see why this is required, consider 9.2.2 (i). Show the 
procedure E to be inadmissible. Hint: One can always add large negative values 
ofTi and Th to the region tti,i without violating the FWER. 

Problem 9.15 Prove part (i) of Theorem 9.2.3. 

Problem 9.16 In general, show C s = C J. In the case s = 2, show (9.27). 

Section 9.3 

Problem 9.17 Show that 



Problem 9.18 (i) For the validity of Lemma 9.3.1 it is only required that the 

probability of rejecting homogeneity of any set containing {/xq,..., /ii„ 1 } 
as a proper subset tends to 1 as the distance between the different groups 
(9.48) all —» oo, with the analogous condition holding for H 2 , ■ ■ ■ ,H' r . 

(ii) The condition of part (i) is satisfied for example if homogeneity of a set S 
is rejected for large values of |X». — X..\, where the sum extends over 
the subscripts i for which /x, £ S. 

Problem 9.19 In Lemma 9.3.2, show that a„-i = a is necessary for 
admissibility. 

Problem 9.20 Prove Lemma 9.3.3 when s is odd. 

Problem 9.21 Show that the Tukey levels (vi) satisfy (9.54) when s is even but 
not when s is odd. 
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Problem 9.22 The Tukey T-method leads to the simultaneous confidence 
intervals 

| (A> - AT) - (fij - m)\ < ' for all *, j. (9.102) 

^ sn(n — 1) 

[The probability of (9.102) is independent of the /la’s and hence equal to 1 — a s .] 


Section 9.4 

Problem 9.23 (i) A function L satisfies the first equation of (9.65) for all u, 

x, and orthogonal transformations Q if and only if it depends on u and x 
only through u'x, x'x, and u'u. 

(ii) A function L is equivariant under G 2 if and only if it satisfies (9.67). 

Problem 9.24 (i) For the confidence sets (9.73), equivariance under G\ and 

G 2 reduces to (9.74) and (9.75) respectively. 

(ii) For fixed ( 1 / 1 ,..., y r ), the statements ^2 WjJ/j £ A hold for all (m, ..., u r ) 

with = 1 if and only if A contains the interval I(y) = 

i-VEW,+Vzy?b 

(iii) Show that the statement following (9.77) ceases to hold when r = 1. 

Problem 9.25 Let X , (i = 1,..., r) be independent N(£i, 1). 

(i) The only simultaneous confidence intervals equivariant under Go are those 
given by (9.83). 

(ii) The inequalities (9.83) and (9.85) are equivalent. 

(iii) Compared with the Scheffe intervals (9.72), the intervals (9.85) for ^2 Uj£j 

are shorter when = Y and longer when uj =='•• • • — 

[(ii): For a fixed u = ( 111 ,..., u r ), ^2 u iVi is maximized subject to \yi\ < A for all 

i, by yt = A when m > 0 and yi = —A when m < 0.] 

Section 9.5 

Problem 9.26 (i) The confidence intervals L(u\y,S) = Y^Uiyi — c(S) are 

equivariant under G 3 if and only if L(u; by, bS ) = 6 L(u; y, S) for all b > 0. 

(ii) The most general confidence sets (9.90) which are equivariant under Gi, 
G 2 , and G 3 are of the form (9.91). 

Problem 9.27 (i) In Example 9.5.2, the set of linear functions ^2wiCti = 

— £•■) f° r all w can also be represented as the set of functions 
)T) f° r a b w satisfying JD Wi = 0 . 

(ii) The set of linear functions ~ £<■• ~ Cj- + £•■•) 

for all w is equivalent to the set ]T) J2 w ij£ij- for a ll w satisfying JA Wij = 
J2j wij = 0 . 
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(iii) Determine the simultaneous confidence intervals (9.92) for the set of linear 
functions of part (ii). 

Problem 9.28 (i) In Example 9.5.1, the simultaneous confidence intervals 

(9.92) reduce to (9.96). 

(ii) What change is needed in the confidence intervals of Example 9.5.1 if the v’s 
are not required to satisfy (9.95), i.e., if simultaneous confidence intervals 
are desired for all linear functions ~22 v i£i instead of all contrasts? Make a 
tabic showing the effect of this change for s = 2, 3, 4, 5; rn = n = 3, 5,10. 

Problem 9.29 Tukey’s T-Method. Let Xi (i = 1,..., r) be independent N(£i, 1), 
and consider simultaneous confidence intervals 

L[(i,j);x]<£j-£i<M[(i,j);x] for all i ^ j. (9.103) 

The problem of determining such confidence intervals remains invariant under the 
group G' 0 of all permutations of the X’s and under the group G 2 of translations 
gx = x + a. 

(i) In analogy with (9.64), attention can be restricted to confidence bounds 
satisfying 

L[(i,j)\x] = i);x\. (9.104) 

(ii) The only simultaneous confidence intervals satisfying (9.104) and equivari- 
ant under G'o and G 2 are those of the form 

S(x) ~ {£ : Xj — Xi — A < < Xj — x-i + A for all i ^ j}. (9.105) 

(iii) The constant A for which (9.105) has probability 7 is determined by 

Po{max \Xj - Xi\ < A} = P 0 {X (n) - A' (1) < A} = 7 , (9.106) 

where the probability Po is calculated under the assumption that £1 = 
••• = 6 - 

Problem 9.30 In the preceding problem consider arbitrary contrasts Ci£; 
with X] D = 0. The event 

|(X # -X i )-(fc-&)l < A for all i^j (9.107) 

is equivalent to the event 

| y: dXi - ^2 dii | < y^|c;| for all c with ^ a = 0, (9.108) 

which therefore also has probability 7 . This shows how to extend the Tukey 
intervals for all pairs to all contrasts. 

[That (9.108) implies (9.107) is obvious. To see that (9.107) implies (9.108), let 
yi = Xi — £; and maximize | ^ dyi\ subject to \yj — Vi\ < A for all i and j. Let 
P and N denote the sets {i : d > 0} and {i : d < 0}, so that 

CiVi = Y CiVi _ Y l Ci l Vi ' 

ieP i£N 

Then for fixed c, the sum ^2 dyi is maximized by maximizing the yC s for i £ P 
and minimizing those for i £ N. Since \yj — yt\ < A, it is seen that ^2dVi is 
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maximized by yi = A/2 for i £ P , yi = —A/2 for i £ N. The minimization of 
E c iVi is handled analogously.] 


Problem 9.31 (i) Let X(j = 1 = l,...,s) be independent 

N(£i,cr 2 ), a 2 unknown. Then the problem of obtaining simultaneous con¬ 
fidence intervals for all differences is invariant under Go, G 2 , and 

the scale changes G 3 . 


(ii) The only equivariant confidence bounds based on the sufficient statistics 
Xi. and S 2 = E E(A% — Xi .) 2 and satisfying the condition corresponding 
to (9.104) are those given by 

A (9.109) 


S(x) = 


x : Xn. — Xi. — 


< Xi. — Xi. + 


y/n — s 
A' 


for all i/j 


y/n — s 

with A' determined by the null distribution of the Studentized range 
max | Xj. — Xi. 


< A'j = 7- 


S/y/n — s 

(iii) Extend the results of Problem 9.30 to the present situation. 


(9.110) 


Problem 9.32 Construct an example [i.e., choose values n\ = ■ ■ ■ = n s = n 
and a particular contrast (ci,... ,c s )] for which the Tukey confidence intervals 
(9.108) are shorter than the Scheffe intervals (9.96), and an example in which the 
situation is reversed. 


Problem 9.33 Dunnett’s method. Let Xoj (j = l,...,m) and Xtk (i = 

1 .. ..,«;fe = l,...,n) represent measurements on a standard and s competing 
new treatments, and suppose the X’s are independently distributed as IV(£ 0 , u 2 ) 
and N(/i,a 2 ) respectively. Generalize Problems 9.29 and 9.31 to the problem 
of obtaining simultaneous confidence intervals for the s differences £; — £0 (i = 

1.. ..,s). 

Problem 9.34 In generalization of Problem 9.30, show how to extend the 
Dunnett intervals of Problem 9.33 to the set of all contrasts. 

[Use the fact that the event |yt — yo\ < A for i = 1,..., s is equivalent to the 
event | ^“=0 c iVA < A Ei=i \ C A for a11 ( c o, ■ ■ ■, c s ) satisfying J2Uo = °-] 

Note. As is pointed out in Problems 9.26(iii) and 9.32, the intervals resulting 
from the extension of the Tukey (and Dunnett) methods to all contrasts are 
shorter than the Scheffe intervals for the differences for which these methods were 
designed and for contrasts close to them, and longer for some other contrasts. 
For details and generalizations, see for example Miller (1981), Richmond (1982), 
and Shaffer (1977a). 

Problem 9.35 In the regression model of Problem 7.8, generalize the confidence 
bands of Example 9.5.3 to the regression surfaces 

(i) /ii(ei,...,e s ) = E^=i e j/%; 

(ii) h 2 (e 2 ,..., e s ) = fr + J2j= 2 
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9.7 Notes 

Many of the basic ideas for making multiple inferences were pioneered by Tukey 
(1953); see Tukey (1991), Braun (1994), and Shaffer (1995). See Duncan (1955) 
for an exposition of the ideas of one of the early workers in the area of multiple 
comparisons. 

Comprehensive accounts on the theory and methodology of multiple testing 
can be found in Hochberg and Tamhane (1987), Westfall and Young (1993), 
and Hsu (1996) and Dudoit, Shaffer and Boldrick (2003). Some recent work on 
stepwise procedures includes Troendle (1995), Finner and Roters (1998, 2002), 
and Romano and Wolf (2004). Confidence sets based on multiple tests are studied 
in Haytner and Hsu (1994), Miwa and Hayter (1999) and Holm (1999). 

The first simultaneous confidence intervals (for a regression line) were obtained 
by Working and Hotelling (1929). Scheffe’s approach was generalized in Roy and 
Bose (1953). The optimal property of the Scheffe intervals presented in Section 
9.4 is a special case of results of Wijsman (1979, 1980). A review of the literature 
on the relationship of tests and confidence sets for a parameter vector with the 
associated simultaneous confidence intervals for functions of its components can 
be found in Kanoh and Kusunoki (1984). Some alternative methods to construct 
confidence bands in regression contexts are given in Faraway and Sun (1995) and 
Spurrier (1999). 
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Conditional Inference 


10.1 Mixtures of Experiments 

The present chapter has a somewhat different character from the preceding ones. 
It is concerned with problems regarding the proper choice and interpretation of 
tests and confidence procedures, problems which—despite a large literature— 
have not found a definitive solution. The discussion will thus be more tentative 
than in earlier chapters, and will focus on conceptual aspects more than on 
technical ones. 

Consider the situation in which either the experiment £ of observing a random 
quantity X with density pe (with respect to p,) or the experiment T of observing 
an A' with density qe (with respect to v ) is performed with probability p and 
q — l — p respectively. On the basis of X, and knowledge of which of the two 
experiments was performed, it is desired to test Ho : 9 = do against Hi : 9 = 6 1 . 
For the sake of convenience it will be assumed that the two experiments have the 
same sample space and the same a-field of measurable sets. The sample space of 
the overall experiment consists of the union of the sets 


Xq = {(/,*) : I = 0, x £ X} and X\ = {(I,x) : I = 1, x £ X} 


where / is 0 or 1 as £ or T is performed. 
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A level-a test of Ho is defined by its critical function 

<j>i(x) = fai,x) 


and must satisfy 

pE 0 [fa(X) \ + qE 0 [</>i(X) | .fJ = P J faPe 0 dp + q J faqg 0 dv < a. (10.1) 

Suppose that p is unknown, so that Ho is composite. Then a level-a test of Ho 
satisfies ( 10 . 1 ) for all 0 < p < 1 , and must therefore satisfy 


-/■ 


ao= 4>opo 0 dfi < a and ai = (j)iqe 0 dv < a. 


As a result, a UMP test against Hi exists and is given by 


<f>o{x) = < 70 if PBl [ X } ^ Co, 

0 P»o(*) 


Mx)= \ ? if SM< ci ’ 


where the d and 7 * are determined by 

Ee 0 [fa{X) I S] = Ee 0 [fa(X) \ x] = a. 

The power of this test against H 1 is 

f3(p) = pP o + <?/?i 

with 

Po = E Bl [fa ( X ) | £], 0! = E Bl [fa (X) | X\. 


( 10 . 2 ) 


(10.3) 


(10.4) 


(10.5) 


( 10 . 6 ) 


The situation is analogous to that of Section 4.4 and, as was discussed there, it 
may be more appropriate to consider the conditional power Pi when I = i, since 
this is the power pertaining to the experiment that has been performed. As in 
the earlier case, the conditional power Pi can also be interpreted as an estimate 
of the unknown P(p), which is unbiased, since 


E(Pi) = pPo + qPi = P(p)- 


So far, the probability p of performing experiment £ has been assumed to be 
unknown. Suppose instead that the value of p is known, say p = \. The hypothesis 
H can be tested at level a by means of (10.3) as before, but the power of the 
test is now known to be |(/3o + Pi)- Suppose that po = .3, 0\ = .9, so that at the 
start of the experiment the power is |(.3 + .9) = . 6 . Now a fair coin is tossed to 
decide whether to perform £ (in case of heads) or T (in case of tails). If the coin 
shows heads, should the power be reassessed and scaled down to .3? 

Let us postpone the answer and first consider another change resulting from 
the knowledge of p. A level-a test of H now no longer needs to satisfy (10.2) but 
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only the weaker condition 


1 

2 


J 4’oPe 0 dp,+ j <j>iqe 0 dv 


< a. 


(10.7) 


The most powerful test against K is then again given by (10.3), but now with 
Co = ci = c and 70 = 7i = 7 determined by (Problem 10.3) 


§(a 0 + ai) = a, 


( 10 . 8 ) 


where 

ao = S flo [^o(X)|f], ai =Ee 0 [MX)\^]. (10.9) 

As an illustration of the change, suppose that experiment T is reasonably infor¬ 
mative, say that the power /3i given by ( 10 . 6 ), is .8, but that £ has little ability 
to distinguish between pg 0 and pg 1 . Then it will typically not pay to put much of 
the rejection probability into ao; if /3o [given by ( 10 . 6 )] is sufficiently small, the 
best choice of ao an d «i satisfying (10.8) is approximately ao ~ 0, on « 2a. The 
situation will be reversed if T is so informative that T can attain power close 
to 1 with an ai much smaller than a/ 2 . 

When p is known, there are therefore two issues. Should the procedure be 
chosen which is best on the average over both experiments, or should the best 
conditional procedure be preferred; and, for a given test or confidence procedure, 
should probabilities such as level, power, and confidence coefficient be calculated 
conditionally, given the experiment that has been selected, or unconditionally? 
The underlying question is of course the same: Is a conditional or unconditional 
point of view more appropriate? 

The answer cannot be found within the model but depends on the context. If 
the overall experiment will be performed many times, for example in an industrial 
or agricultural setting, the average performance may be the principal feature of 
interest, and an unconditional approach suitable. However, if repetitions refer to 
different clients, or are potential rather than actual, interest will focus on the par¬ 
ticular event at hand, and conditioning seems more appropriate. Unfortunately, 
as will be seen in later sections, it is then often not clear how the conditioning 
events should be chosen. 

The difference between the conditional and the unconditional approach tends 
to be most striking, and a choice between them therefore most pressing, when 
the two experiments £ and T differ sharply in the amount of information they 
contain, if for example the difference |/?i — (3 o| in (10.6) is large. To illustrate an 
extreme situation in which this is not the case, suppose that £ and T consist 
in observing X with distribution N(0, 1) and N(—8, 1) respectively, that one of 
them is selected with known probabilities p and q respectively, and that it is 
desired to test H : 8 = 0 against K : 8 > 0. Here £ and T contain exactly the 
same amount of information about 8. The unconditional most powerful level-a 
test of H against 8\ > 0 is seen to reject (Problem 10.5) when X > c if £ is 
performed, and when X < —c if T is performed, where Po(X > c) = a. The test 
is UMP against 8 > 0, and happens to coincide with the UMP conditional test. 

The issues raised here extend in an obvious way to mixtures of more than 
two experiments. As an illustration of a mixture over a continuum, consider 
a regression situation. Suppose that Xi,... ,X n are independent, and that the 
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conditional density of Xi given t% is 

Xj-a - fitj \ 

V o ) ' 

The ti themselves are obtained with error. They may for example be indepen¬ 
dently normally distributed with mean a and known variance t 2 , where the a are 
the intended values of the ti. Then it will again often be the case that the most 
appropriate inference concerning a, (3, and a is conditional on the observed values 
of the f’s (which represent the experiment actually being performed). Whether 
this is the case will, as before, depend on the context. 

The argument for conditioning also applies when the probabilities of perform¬ 
ing the various experiments are unknown, say depend on a parameter i9, provided 
t? is unrelated to 8, so that which experiment is chosen provides no information 
concerning 8. A more precise statement of this generalization is given at the end 
of the next section. 


10.2 Ancillary Statistics 

Mixture models can be described in the following general terms. Let { £ z , 2 £ Z} 
denote a collection of experiments of which one is selected according to a known 
probability distribution over Z. For any given z, the experiment £ z consists in 
observing a random quantity X , which has a distribution Pg(- \ z). Although this 
structure seems rather special, it is common to many statistical models. 

Consider a general statistical model in which the observations X are distributed 
according to Pg, 8 £ fi, and suppose there exists an ancillary statistic, that is, a 
statistic Z whose distribution F does not depend on 8. Then one can think of X 
as being obtained by a two-stage experiment: Observe first a random quantity Z 
with distribution F; given Z = z, observe a quantity A' with distribution Pg(- \ z). 
The resulting X is distributed according to the original distribution Pg. Under 
these circumstances, the argument of the preceding section suggests that it will 
frequently be appropriate to take the conditional point of view . 1 (Unless Z is 
discrete, these definitions involve technical difficulties concerning sets of measure 
zero and the existence of conditional distributions, which we shall disregard.) 

An important class of models in which ancillary statistics exist is obtained by 
invariance considerations. Suppose the model V = {Pg, 8 £ 12} remains invariant 
under the transformations 

X —¥ gX, 8 —> gd\ g £ G, g £ G, 
and that G is transitive over Q. 2 

Theorem 10.2.1 IfV remains invariant under G and if G is transitive over Q, 
then a maximal invariant T (and hence any invariant) is ancillary. 


1 A distinction between experimental mixtures and the present situation, relying on 
aspects outside the model, is discussed by Basu (1964) and Kalbfleisch (1975). 

2 The family P is then a group family; see TPE2, Section 1.3. 
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Proof. It follows from Theorem 6.3.2 that the distribution of a maximal invariant 
under G is invariant under G. Since G is transitive, only constants are invariant 
under G. The probability Pg(T € B ) is therefore constant, independent of 6, for 
all B, as was to be proved. ■ 


As an example, suppose that A' = (Ai,..., X n ) is distributed according to a 
location family with joint density f(x i — 9,... ,x n — 9). The most powerful test 
of H : 9 = 6o against K : 9 = 9i > 9 q rejects when 


f(x i -9 1 ,...,x n - 9 1 ) 
f(xi -9 0 ,...,x„- 9 0 ) 


( 10 . 10 ) 


Here the set of differences Yi = Xi — X n (i = 1,... ,n — 1) is ancillary. This 
is obvious by inspection and follows from Theorem 10.2.1 in conjunction with 
Example 6.2.1 (i). It may therefore be more appropriate to consider the test¬ 
ing problem conditionally given Y\ = yi,...,Y n -i = y n -i- To determine the 
most powerful conditional test, transform to Yi,...,Y n , where Y n = X n . The 
conditional density of Y n given yi,, y n -i is 

f(yi + y n - 9, ■ ■ ■, yn-i + y n -9,y n - 9) 


pe(y n \yi,-..,y n -i)= r ,, , . , , 

J f(y i +«,■■■, 2/n-i + u, u ) du 

and the most powerful conditional test rejects when 

PaAVn I 2/1, • • •, Vn-l) 


Pe 0 (y n | 2/i, • • • ,2/r»-l) 

In terms of the original variables this becomes 
f(xi x n - 9\) 


> c(yi, ■ ■ ■ ,2/n-l). 


f(xi -9o,...,x„- 9 0 ) 


> c(xi - Xn,... , x n -l ~ X n ). 


( 10 . 11 ) 


( 10 . 12 ) 


(10.13) 


The constant c(*i — x„~i — x n ) is determined by the fact that the condi¬ 

tional probability of (10.13), given the differences of the x’s, is equal to a when 
9 = 9 0 . 

For describing the conditional test (10.12) and calculating the critical value 
c(y i,..., y n - 1 ), it is useful to note that the statistic Y n = X n could be replaced 
by any other Y„ satisfying the equivariance condition 3 


Y n (xi + a,...,Xn + a) = Y n (xi,...,x n ) + a for all a. (10.14) 


This condition is satisfied for example by the mean of the A'’s, the median, or any 
of the order statistics. As will be shown in the following Lemma 10.2.1, any two 
statistics Y„ and Y’ n satisfying (10.14) differ only by a function of the differences 

Yi = Xi — Xn (i = 1,..., n— 1). Thus conditionally, given the values yi, _, y n -i, 

Y n and Y.' : differ only by a constant, and their conditional distributions (and the 
critical values c(yi ,..., y n -i)) differ by the same constant. One can therefore 
choose Y n , subject to (10.14), to make the conditional calculations as convenient 
as possible. 


Lemma 10.2.1 IfY n andYn both satisfy (10.14), then their difference A = Y„ — 
Y n depends on (xi ,..., x n ) only through the differences ( xi — x n , ■ ■ ■, x„-i — x„). 


3 For a more detailed discussion of equivariance, see TPE2, Chapter 3. 
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Proof. Since Y„ and Y„ satisfy (10.14), 

A(xi + a, + a) = A(xi,... ,x n ) for all a. 

Putting a = —x n , one finds 

A(® 1 , . . . , Xn) = A(xi -x„,..., Xn-1 - Xn, 0), 
which is a function of the differences. ■ 

The existence of ancillary statistics is not confined to models that remain 
invariant under a transitive group G. The mixture and regression examples of 
Section 10.1 provide illustrations of ancillaries without the benefit of invariance. 
Further examples are given in Problems 10.8-10.13. 

If conditioning on an ancillary statistic is considered appropriate because it 
makes the inference more relevant to the situation at hand, it is desirable to carry 
the process as far as possible and hence to condition on a maximal ancillary. An 
ancillary Z is said to be maximal if there does not exist an ancillary U such that 
Z = f(U) without Z and U being equivalent. [For a more detailed treatment, 
which takes account of the possibility of modifying statistics on sets of measure 
zero without changing their probabilistic properties, see Basu (1959).] 

Conditioning, like sufficiency and invariance, leads to a reduction of the data. In 
the conditional model, the ancillary is no longer part of the random data but has 
become a constant. As a result, conditioning often leads to a great simplification 
of the inference. Choosing a maximal ancillary for conditioning thus has the 
additional advantage of providing the greatest reduction of the data. 

Unfortunately, maximal ancillaries are not always unique, and one must then 
decide which maximal ancillary to choose for conditioning. [This problem is dis¬ 
cussed by Cox (1971) and Becker and Gordon (1983).] If attention is restricted 
to ancillary statistics that are invariant under a given group G, the maximal 
ancillary of course coincides with the maximal invariant. 

Another issue concerns the order in which to apply reduction by sufficiency 
and ancillarity. 


Example 10.2.1 Let (X;,Y)), i = be independently distributed 

according to a bivariate normal distribution with E(Xi) = E(Yi) = 0, 
Var(A'i) = Var(Y;) = 1, and unknown correlation coefficient p. Then Xi ,..., X n 
are independently distributed as 77(0,1) and are therefore ancillary. The 
conditional density of the Y’s given X\ = Xi, ..., X n = x„ is 

CeXP (~ 2(1 — p 2 ) ^ Vi ~ pXi)2 ) ’ 

with the sufficient statistics (YIY?, Y x iYi). 

Alternatively, one could begin by noticing that (Yi,..., Y n ) is ancillary. The 
conditional distribution of the X’s given Yi = j/i,... ,Y n = y n then admits the 
sufficient statistics ('£2 Xf,^2 Xiyi). A unique maximal ancillary V does not exist 
in this case, since both the X’s and Y’s would have to be functions of V. Thus 
V would have to be equivalent to the full sample (Xi, Yi),..., (X n , Y n ), which is 
not ancillary. 
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Suppose instead that the data are first reduced to the sufficient statistics T = 
( y, + ^ Y 2 ,Y/ XiYi). Based on T, no nonconstant ancillaries appear to exist. 4 

This example and others like it suggest that it is desirable to reduce the data as 
far as possible through sufficiency, before attempting further reduction by means 
of ancillary statistics. ■ 

Note that contrary to this suggestion, in the location example at the beginning 
of the section, the problem was not first reduced to the sufficient statistics A(i) < 
• • • < X( n ). The omission can be justified in hindsight by the fact that the optimal 
conditional tests are the same whether or not the observations are first reduced 
to the order statistics. 

In the structure described at the beginning of the section, the variable Z that 
labels the experiment was assumed to have a known distribution. The argument 
for conditioning on the observed value of Z does not depend on this assumption. 
It applies also when the distribution of Z depends on an unknown parameter 9, 
which is independent of 9 and hence by itself contains no information about 9, 
that is, when the distribution of Z depends only on 9, the conditional distribution 
of A' given Z = z depends only on 9, and the parameter space SI for ( 9, 9) is a 
Cartesian product SI = Sli x SI 2 , with 

(9,9) £ S2 9 £ Q 1 and 9 £ SI 2 • (10.15) 

(the parameters 9 and 9 are then said to be variation-independent, or unrelated.) 

Statistics Z satisfying this more general definition are called partial ancillary or 
S-ancillary. (The term ancillary without modification will be reserved here for a 
statistic that has a known distribution.) Note that if A' = (T, Z) and Z is a partial 
ancillary, then T is a partial sufficient statistic in the sense of Problem 3.60. For 
a more detailed discussion of this and related concepts of partial ancillarity, see 
for example Basu (1978) and Barndorff-Nielsen (1978). 

Example 10.2.2 Let X and Y be independent with Poisson distributions P( A) 
and P(p), and let the parameter of interest be 9 = p/\. It was seen in Section 
10.4 that the conditional distribution of Y given Z = X + Y = z is binomial 
b(p,z) with p = p/(X + p) = 9/(9 + 1) and therefore depends only on 9, while 
the distribution of Z is Poisson with mean 9 = A + p. Since the parameter space 
0 < A, p < 00 is equivalent to the Cartesian product of 0 < 9 < 00, 0 < 9 < 00, 
it follows that Z is ^-ancillary for 9. 

The UMP unbiased level-a test of H : p < A against p > A is UMP also among 
all tests whose conditional level given z is a for all a. (The class of conditional 
tests coincides exactly with the class of all tests that are similar on the boundary 
p = A.) ■ 

When Z is S'-ancillary for 9 in the presence of a nuisance parameter 9, the 
unconditional power (3(9,9 ) of a test <p of H : 9 = 9q may depend on 9 as well 
as on 9. The conditional power (3(9 \ z) = E$[ip(X) \ z] can then be viewed as 
an unbiased estimator of the (unknown) (3(9,9), as was discussed at the end of 
Section 4.4. On the other hand, if no nuisance parameters 9 are present and Z 


4 So far, nonexistence has not been proved. It seems likely that a proof can be obtained 
by the methods of Unni (1978). 
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is ancillary for 9 , the unconditional power /3(d) = Egip(X) and the conditional 
power (3(6 \ z) provide two alternative evaluations of the power of <p against 9, 
which refer to different sampling frameworks, and of which the latter of course 
becomes available only after the data have been obtained. 

Surprisingly, the S'-anc.illarity of A' + Y in Example 10.2.2 does not extend to 
the corresponding binomial problem. 

Example 10.2.3 Let X and Y have independent binomial distributions b(p\, m) 
and b(p 2 ,n) respectively. Then it was seen in Section 4.5 that the conditional 
distribution of Y given Z = X + Y = z depends only on the crossproduct ratio 
A = P 2 ?i/pi ?2 (qi = 1 — pi). However, Z is not S'-ancillary for A. To see this, 
note that S-ancillarity of Z implies the existence of a parameter i3 unrelated to A 
and such that the distribution of Z depends only on 6 . As A changes, the family 
of distributions {P&, i3 £ f^} of Z would remain unchanged. This is not the case, 
since Z is binomial when A = 1 and not otherwise (Problem 10.15). Thus Z is 
not S'-ancillary. 

In this example, all unbiased tests of H : A = Ao have a conditional level 
given z that is independent of z, but conditioning on z cannot be justified by 
S-ancillarity. ■ 

Closely related to this example is the situation of the multinomial 2x2 table 
discussed from the point of view of unbiasedness in Section 4.6. 

Example 10.2.4 In the notation of Section 4.6, let the four cell entries of a 
2x2 table be A, A', Y, Y' with row totals A + X' = M, Y + Y' = N, and 
column totals X + Y = T, X' + Y' = T', and with total sample size M + 
N = T + T' = s. Here it is easy to check that (M, N) is S'-ancillary for 6 = 
( 61 , 62 ) = (pab/pb,P ab /p b ) with 1 9 = pb ■ Since the cross-product ratio A can 
be expressed as a function of (9 1 , 9 2 ), it may be appropriate to condition a test of 
H : A = Ao on (M,N). Exactly analogously one finds that (T,T') is S-ancillary 
for 9' = ( 6 ), 62 ) = (pab/pa,P ab /p a ), and since A is also a function of ( 61 , 62 ), it 
may be equally appropriate to condition a test of H on (T,T'). One might hope 
that the set of all four marginals (M, N, T, T') = Z would be S-ancillary for A. 
However, it is seen from the preceding example that this is not the case. 

Here, all unbiased tests have a constant conditional level given z. However, 
S-ancillarity permits conditioning on only one set of margins (without giving any 
guidance as to which of the two to choose), not on both. ■ 

Despite such difficulties, the principle of carrying out tests and confidence 
estimation conditionally on ancillaries or S-ancillaries frequently provides an 
attractive alternative to the corresponding unconditional procedures, primarily 
because it is more appropriate for the situation at hand. However, insistence on 
such conditioning leads to another difficulty, which is illustrated by the following 
example. 

Example 10.2.5 Consider N populations J _ J i , and suppose that an observation 
X, from has a normal distribution N(£i, 1). The hypothesis to be tested is 
// : £1 — • • • — £jv. Unfortunately, N is so large that it is not practicable to take 
an observation from each of the populations; the total sample size is restricted to 
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be n < N. A sample II 7 ,..., FI j n of n of the N populations is therefore selected 
at random, with probability 1 / O for each set of n, and an observation Xj ( is 
obtained from each of the populations in the sample. 

Here the variables Ji,J n are ancillary, and the requirement of conditioning 
on ancillaries would restrict any inference to the n populations from which ob¬ 
servations are taken. Systematic adherence to this requirement would therefore 
make it impossible to test the original hypothesis H. 5 Of course, rejection of the 
partial hypothesis Hj 1 ,...,j n : £j 1 = ■ ■ ■ = £j n would imply rejection of the original 
H. However, acceptance of Hj 1 would permit no inference concerning H. 

The requirement to condition in this case runs counter to the belief that a 
sample may permit inferences concerning the whole set of populations, which 
underlies much of statistical practice. 

With an unconditional approach such an inference is provided by the test with 
rejection region 


E 



i n 


2 

> c, 


where c is the upper a-percentage point of y 2 with n — 1 degrees of freedom. 
Not only does this test actually have unconditional level a, but its conditional 
level given Ji = ji,... ,J n = jn also equals a for all (ji, ■ ■ ■ ,j n ). There is in 
fact no difference in the present case between the conditional and the uncondi¬ 
tional test: they will accept or reject for the same sample points. However, as 
has been pointed out, there is a crucial difference between the conditional and 
unconditional interpretations of the results. 

M A'i,.-.,j'nteif ■ ■ ■ , £j„) denotes the conditional power of this test given Ji = 
ji,... , J„ = j n , its unconditional power is 

X] fill fel 2 • ' ' 1 Cjn ) 

o 

summed over all (^) n-tuples j i < ... < j„. As in the case with any test, the 
conditional power given an ancillary (in the present case Ji ,..., J n ) can be viewed 
as an unbiased estimate of the unconditional power. ■ 


10.3 Optimal Conditional Tests 

Although conditional tests are often sensible and are beginning to be employed 
in practice [see for example Lawless (1972, 1973, 1978) and Kappenman (1975)], 
not much theory has been developed for the resulting conditional models. Since 
the conditional model tends to be simpler than the original unconditional one, 
the conditional point of view will frequently bring about a simplification of the 
theory. This possibility will be illustrated in the present section on some simple 
examples. 


■''For other implications of this requirement, called the weak conditionality principle, 
see Birnbaum (1962) and Berger and Wolpert (1988). 
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Example 10.3.1 Specializing the example discussed at the beginning of Section 
10.1, suppose that a random variable is distributed according to N(9,a 2 ) or 
1V($, (Tq) as I = 1 or 0, and that P(I = 1) = P{I = 0) = |. Then the most 
powerful test of H : 9 = 9o against 9 = 9 1 (> 9 o) based on (7, A') rejects when 

s- §(0o + 0i) ^ , 

A UMP test against the alternatives 0 > 9o therefore does not exist. On the other 
hand, if H is tested conditionally given I = i, a UMP conditional test exists and 
rejects when X > a where P(X > a \ I = i) = a for i = 0, 1. ■ 

The nonexistence of UMP unconditional tests found in this example is typical 
for mixtures with known probabilities of two or more families with monotone 
likelihood ratio, despite the existence of UMP conditional tests in these cases. 


Example 10.3.2 Let Xi,, X n be a sample from a normal distribution 
AT(£,a 2 £ 2 ), £ > 0, with known coefficient of variation a > 0, and consider the 
problem of testing H : £ = £o against K : £ > £o- Here T = ( T\,Ti ) with 
Ti = X, T 2 = \J (1/n) Xf is sufficient, and Z = T 1 /T 2 is ancillary. If we let 
V = \JnT 2 la, the conditional density of V given Z = z is equal to (Problem 10.18) 


p ( (v | z) 


k 

¥ 


n — 


V 


1 


exp 


1 |"u z^/n~\ 2 

2 It cT 


(10.16) 


The density has monotone likelihood ratio, so that the rejection region V > C(z) 
constitutes a UMP conditional test. 

Unconditionally, Y = X and S 2 = ^2(Xi — X) 2 are independent with joint 
density 

“'""^^exP ("a; 2 - 2o^ s2 ) ’ ( 10 - 17 ) 

and a UMP test does not exist. [For further discussion of this example, see Hinkley 
(1977).] ■ 


An important class of examples is obtained from situations in which the model 
remains invariant under a group of transformations that is transitive over the 
parameter space, that is, when the given class of distributions constitutes a group 
family. The maximal invariant V then provides a natural ancillary on which to 
condition, and an optimal conditional test may exist even when such a test does 
not exist unconditionally. Perhaps the simplest class of examples of this kind are 
provided by location families under the conditions of the following lemma. 


Lemma 10.3.1 Let X\, ..., X n be independently distributed according to f(xi — 
9), with f strongly unimodal. Then the family of conditional densities ofY' n = X n 
given Yi = Xi — X n (i = 1,..., n — 1) has monotone likelihood ratio. 
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Proof. The conditional density (10.11) is proportional to 


f{y n +yi-d)--- f{y n + y n -1 - 9)f(y n - 9) (10.18) 


By taking logarithms and using the fact that each factor is strongly unimodal, 
it is seen that the product is also strongly unimodal, and the result follows from 
Example 8.2.1. ■ 


Lemma 10.3.1 shows that for strongly unimodal / there exists a UMP 
conditional test of H : 9 < 9q against K : 9 > 9q which rejects when 


X n > c(X i - A n ,..., X n _i - A„). (10.19) 


Conditioning has reduced the model to a location family with sample size one. 
The double-exponential and logistic distributions are both strongly unimodal 
(Section 9.2), and thus provide examples of UMP conditional tests. In neither 
case does there exist a UMP unconditional test unless n = 1. 

As a last class of examples, we shall consider a situation with a nuisance 
parameter. Let AT,..., X m and Yi,..., Y n be independent samples from location 
families with densities f(x i — £,..., Xm — f) and g(yi — r ),..., y n — rj) respectively, 
and consider the problem of testing H : y < £ against K : r) > £. Here the 
differences Ui = Xi — X m and V) = Y, — Y n are ancillary. The conditional density 
of X = X m and Y = Y n given the it’s and v’s is seen from (10.18) to be of the 
form 


fu(x - Qgl{y - y), 


( 10 . 20 ) 


where the subscripts u and v indicate that /* and g* depend on the it’s and 
it’s respectively. The problem of testing H in the conditional model remains 
invariant under the transformations: x 1 = x + c, y' = y + c, for which Y — X 
is maximal invariant. A UMP invariant conditional test will then exist provided 
the distribution of Z = Y — X, which depends only on A = t) - has monotone 
likelihood ratio. The following lemma shows that a sufficient condition for this 
to be the case is that /* and g* have monotone likelihood ratio in x and y 
respectively. 


Lemma 10.3.2 Let X, Y be independently distributed with densities f*(x — £), 
9* (V ~ v) respectively. If f* and g* have monotone likelihood with respect to £ 
and y, then the family of densities of Z = Y — X has monotone likelihood ratio 
with respect to A = y — £. 
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Proof. The density of Z is 

hA (z) = J g*(y — A)/* (y - z) dy. 


( 10 . 21 ) 


To see that ft a (2) has monotone likelihood ratio, one must show that for any 
A < A', h^' (2)/ ft,A (2) is an increasing function of z. For this purpose, write 

hA’jz) = f g*{y- A') _ g*(y- A)f*{y-z) 

hA(z) J g*(y- A) f g*(u-A)f(u-z)du 

The second factor is a probability density for Y. 

Pz{y) = C z g*{y - A)f*(y - z), (10.22) 


which has monotone likelihood ratio in the parameter z by the assumption made 
about /*. The ratio 


ft-A'(g) 

ft a (2) 


g*(y~ A') 
g*{y- A) 


Pz(y) dy 


(10.23) 


is the expectation of g*(Y — A')/g*(Y — A) under the distribution p z (y)- By the 
assumption about g *, g*(y — A')/g* (y — A) is an increasing function of y, and it 
follows from Lemma 3.4.2 that its expectation is an increasing function of z. ■ 


It follows from (10.18) that /„(* — £) and g*(y — tf) have monotone likelihood 
ratio provided this condition holds for f(x — £) and g{y — r/ ), i.e. provided / and g 
are strongly unimodal. Under this assumption, the conditional distribution ft a (2) 
then has monotone likelihood ratio by Lemma 10.3.2, and a UMP conditional test 
exists and rejects for large values of Z. (This result also follows from Problem 
8.9.) 

The difference between conditional tests of the kind considered in this section 
and the corresponding (e.g., locally most powerful) unconditional tests typically 
disappears as the sample size(s) tend(s) to infinity. Some results in this direction 
are given by Liang (1984); see also BarndorfUNielsen (1983). 

The following multivariate example provides one more illustration of a UMP 
conditional test when unconditionally no UMP test exists. The results will only 
be sketched. The details of this and related problems can be found in the original 
literature reviewed by Marden and Perlman (1980) and Marden (1983). 


Example 10.3.3 Suppose you observe m + 1 independent normal vectors of 
dimension p = pi + P 2 , 

Y = (Ui Y 2 ) and Z u ...,Z m , 

with common covariance matrix S and expectations 

£(Yi) = 771, E(Y 2 ) = E(Z 1) = ■ ■ ■ = E(Z m ) = 0 . 

(The normal multivariate two-sample problem with covariates can be reduced to 
this canonical form.) The hypothesis being tested is H : r)\ = 0. Without the 
restriction E(Y 2 ) = 0, the model would remain invariant under the group G of 
transformations: Y* = YB, Z* = ZB, where B is any nonsingular p x p matrix. 
However, the stated problem remains invariant only under the subgroup G’ in 
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which B is of the form [Problem 10.22(i)] 
B = 


Bn 

B21 

pi 


0 

B22 

P2 


PI 

P2 


If 


z' z = s = 


Su 

S21 


S 12 

S22 


and E = 


E11 

S21 


S12 

S22 


the maximal invariants under G' are the two statistics D = Y2S22Y2 and 


N = 


(y 1 - Si 2 S , 2 - 2 1 y 2 )(Sii - Si 2 s 22 1 s 2 i)- 1 (yi - s^s^y 

1 + D 


and the joint distribution of (IV, D ) depends only on the maximal invariant under 

G', 

A = ?7i(Eii — Ei 2 E 22 1 E2i) 1 r)[. 


The statistic D is ancillary [Problem 10.22(h)], and the conditional distribu¬ 
tion of N given D = d is that of the ratio of two independent y 2 -variables: the 
numerator noncentral y 2 with p degrees of freedom and noncentrality parameter 
A/(l + d ), and the denominator central y 2 with m + 1 — p degrees of freedom. 
It follows from Section 7.1 that the conditional density has monotone likelihood 
ratio. A conditionally UMP invariant test therefore exists, and rejects H when 
(m + 1 — p)N/p > C, where C is the critical value of the P-distribution with p 
and m + 1 — p degrees of freedom. On the other hand, a UMP invariant (uncon¬ 
ditional) test does not exist; comparisons of the optimal conditional test with 
various competitors are provided by Marden and Perlman (1980). ■ 


10.4 Relevant Subsets 

The conditioning variables considered so far have been ancillary statistics, i.e. 
random variables whose distribution is fixed, independent of the parameters gov¬ 
erning the distribution of X, or at least of the parameter of interest. We shall 
now examine briefly some implications of conditioning without this constraint. 
Throughout most of the section we shall be concerned with the simple case in 
which the conditioning variable is the indicator of some subset C of the sample 
space, so that there are only two conditioning events 1=1 (i.e. X £ C) and 1 = 0 
(i.e. A' £ C c , the complement of C). The mixture problem at the beginning of 
Section 10.1, with X\ = C and Xo = C c , is of this type. 

Suppose X is distributed with density pe, and R is a level-a rejection region for 
testing the simple hypothesis H \ 9 = 8 0 against some class of alternatives. For 
any subset C of the sample space, consider the conditional rejection probabilities 

ac = Pe 0 (X e R\C) and a c <= = Pe 0 {X £R\C c ), (10.24) 

and suppose that ac > a and ac c < a. Then we are in the difficulty described 
in Section 10.1. Before X was observed, the probability of falsely rejecting H was 
stated to be a. Now that X is known to have fallen into C (or C c ), should the 
original statement be adjusted and the higher value ac (or lower value ac c ) be 
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quoted? An extreme case of this possibility occurs when C is a subset of R or 
R c , since then P(X G R \ X G C) = 1 or 0. 

It is clearly always possible to choose C so that the conditional level ac exceeds 
the stated a. It is not so clear whether the corresponding possibility always exists 
for the levels of a family of confidence sets for 9, since the inequality must now 
hold for all 9. 


Definition 10.4.1 A subset C of the sample space is said to be a negatively 
biased relevant subset for a family of confidence sets S(X) with unconditional 
confidence level 7 = 1 — a if for some e > 0 


~ic{9) = Pe[9 £ S(X) \XeC\<j 

— e for all 9, 

(10.25) 

and a positively biased relevant subset if 



Po[9 G S(X) | X G C\ > 7 + e 

for all 9. 

(10.26) 

The set C is semirelevant, negatively or positively biased, if respectively 


Pg[9 G S{X) | X G C] < 7 

for all 9 

(10.27) 

or 

Pe[9 G S(X) | A G C] > 7 

for all 9, 

(10.28) 


with strict inequality holding for at least some 9. 


Obvious examples of relevant subsets are provided by the subsets Xq and X\ 
of the two-experiment example of Section 10.1. 

Relevant subsets do not always exist. The following four examples illustrate 
the various possibilities. 


Example 10.4.1 Let X be distributed as N(9, 1), and consider the standard 
confidence intervals for 9: 

S{X) = {9 : X - c < 9 < X + c}, 

where <f?(c) — 4>(—c) = 7 . In this case, there exists not even a semirelevant subset. 

To see this, suppose first that a positively biased semirelevant subset C exists, 
so that 


A(9) = Pq[X - c < 9 < X + c and A G C\ — 7 P e [X G C\ > 0 

for all 9 , with strict inequality for some 9q. Consider a prior normal density A (9) 
for 9 with mean 0 and variance r 2 , and let 


(3{x) = P[x — c<<d<x + c I*], 


where 0 has density A (9). The posterior distribution of 0 given x is then normal 
with mean t 2 :e/( 1 -|-t 2 ) and variance r 2 /(l+r 2 ) [Problem 10.24(i)], and it follows 
that 


(3(x) 


LiVl +' 


+ 


cV 1 +' 


- $ 


r \/1 + ' 


cy/l + t 2 

T 
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< $ 


C\J 1 + T 2 


- $ 


—C-\/l + "J " 2 


< 7 + 


\/ 2 tCT 2 


Next let h(9) = \Z2tvtX(9) = e 9 and 

D = J h{0)A(0) dO < \Plnr J X(9){Pg[X — c<9<X + c and X £ C] 


- Eg[/3(X)I c (X)]}d9+ 

T 


The integral on the right side is the difference of two integrals each of which 
equals P[X — c < 0 < A + c and X £ C], and is therefore 0, so that D < c/t. 

Consider now a sequence of normal priors A m(9) with variances —» oo, and 
the corresponding sequences h m (9 ) and D m . Then 0 < D m < c/rm and hence 
Dm —> 0. On the other hand, D m is of the form D m = A(9)h m (9) d9 , where 

A(9) is continuous, nonnegative, and > 0 for some 9q. There exists S > 0 such 
that A(9) < ^A(9o) for 1 9 — 6*o| < <5 and hence 

p9q-\-S -i 

D m > / -A(9o)hm.(9)d9—>-5A(9o)>0 as m —> oo. 

Je a -s ^ 

This provides the desired contradiction. ■ 


That also no negatively semirelevant subsets exist is a consequence of the 
following result. 


Theorem 10.4.2 Let S(x) be a family of confidence sets for 9 such that Pg[9 £ 
iS'(X)] = 7 for all 9, and suppose that 0 < Pg{C) < 1 for all 9. 

(i) If C is semirelevant, then its complement C c is semirelevant with opposite 
bias. 

(ii) If there exists a constant a such that 

1 > Pe(C ) > a > 0 for all 9 

and C is relevant, then C c is relevant with opposite bias. 

Proof. The result is an immediate consequence of the identity 

Pe(C)[ lc (9) - 7] = [1 - Pe(C)][ 7 - 7 («)]• ■ 

The next example illustrates the situation in which a semirelevant subset exists 
but no relevant one. 


Example 10.4.2 Let A' be N(9, 1), and consider the uniformly most accurate 
lower confidence bounds 9 = X — c for 9, where 4?(c) = 7 . Here S(X) is the 
interval [A — c, 00 ) and it seems plausible that the conditional probability of 
9 £ S(X) will be lowered for a set C of the form X > k. In fact 


Pg{X — c<9\X>k) = 


&(c)—&(k — 0) 
l-<f>(k-0) 

0 


when 9 > k — c, 
when 9 < k — c. 


(10.29) 


The probability (10.29) is always < 7 , and tends to 7 as 9 —> 00 . The set X > k 
is therefore semirelevant negatively biased for the confidence sets S(X). 

We shall now show that no relevant subset C with Pe(C) > 0 exists in this 
case. It is enough to prove the result for negatively biased sets; the proof for 
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positive bias is exactly analogous. Let A be the set of ^-values — oo < x < c + 9, 
and suppose that C is negatively biased and relevant, so that 


If 


Pe[X 6 A | C] < 7 -e for all 9. 


a(9) = Pg(X G C), b{9) = P e (X € A n C), 

then 


b(9) < (y — e) a($) for all 0. (10.30) 

The result is proved by comparing the integrated coverage probabilities 


/ R pR 

a(9) dd, B(R)= / b(0) 

-r J-R 


dd 


with the Lebesgue measure of the intersection C fl (-R, R), 


H(R) = / Ic(x)dx, 
J-R 

where Ic(x) is the indicator of C, and showing that 


A(R) 


B(R) 


»(R) ’ »(R) 

This contradicts the fact that by (10.30), 
B(R) < (7 - e)A{R) 


7 as R —¥ oo. 


for all R , 


(10.31) 


and so proves the desired result. 

To prove (10.31), suppose first that p(oo) < oo. Then if <j> is the standard 
normal density 


7l(oo) 


dd / (j>{x — 6) dx 


So 


dx = fi( oo), 


and analogously B( oo) = 7 /i(oo), which establishes (10.31). 
When /it(oo) = oo, (10.31) will be proved by showing that 


A(R) = n(R) + A'i(R), B(R) = 7 n(R) + K 2 (R), (10.32) 


where Ki(R) and K 2 (R) are bounded. To see (10.32), note that 


/ JA Pit POO 

Ic(x)dx = / Ic(x) / </>(x — 6) d6 

-R J —R .J — oo 

/ oo r pR 

/ Ic (x)<p(x — 9) dx 

-oo .J —R 


while 


/ R r poo 

/ Ic(x)rf)(x — 6) dx 
- R .J — oo 


d9. 


dx 
d6 , 


(10.33) 


A comparison of each of these double integrals with that over the region —R < 
x < R, —R < 9 < R, shows that the difference A(R) — n(R) is made up of 
four integrals, each of which can be seen to be bounded by using the fact that 
f dt < oo [Problem 10.24(h)]. This completes the proof. ■ 
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Example 10.4.3 Let Xi,, X n be independently normally distributed as 
N(t;,<j 2 ), and consider the uniformly most accurate equivariant (and unbiased) 
confidence intervals for £ given by (5.36). 

It was shown by Buehler and Feddersen (1963) and Brown (1967) that in this 
case there exist positively biased relevant subsets of the form 

C : ^ < k. (10.34) 

In particular, for confidence level 7 = .5 and n = 2, Brown shows that with 
C : |X|/|A '2 — Xi\ < |(1 + V2), the conditional level is > | for all values of £ 
and a. Goutis and Casella (1992) provide detailed values for general n. 

It follows from Theorem 10.4.2 that C c is negatively biased semirelevant, and 
Buehler (1959) shows that any set C* : S < k has the same property. These 
results are intuitively plausible, since the length of the confidence intervals is 
proportional to S, and one would expect short intervals to cover the true value 
less often than long ones. 

Theorem 10.4.2 does not show that C c is negatively biased relevant, since the 
probability of the set (10.34) tends to zero as £/cr —» 00 . It was in fact proved by 
Robinson (1976) that no negatively biased relevant subset exists in this case. 

The calculations for C c throw some light on the common practice of stating 
confidence intervals for f only when a preliminary test of H : £ = 0 rejects the 
hypothesis. For a discussion of this practice see Olshen (1973), and Meeks and 
D’Agostino (1983). ■ 

The only type of example still missing is that of a negatively biased relevant 
subset. It was pointed out by Fisher (1956a,b) that the Welch-Aspin solution 
of the Bchrens-Fisher problem (discussed in Sections 6.6 and 11.3) provides an 
illustration of this possibility. The following are much simpler examples of both 
negatively and positively biased relevant subsets. 

Example 10.4.4 An extreme form of both positively and negatively biased sub¬ 
sets was encountered in Section 7.7, where lower and upper confidence bounds 
A < A and A < A were obtained in (7.42) and (7.43) for the ratio A = a\/ <r 2 
in a model II one-way classification. Since 

P( A < A | A < 0 ) = 1 and P( A < A | A < 0 ) = 0 , 

the sets Ci : A < 0 and C 2 : A < 0 are relevant subsets with positive and 
negative bias respectively. ■ 

The existence of conditioning sets C for which the conditional coverage prob¬ 
ability of level -7 confidence sets is 0 or 1, such as in Example 10.4.4 or Problems 
10.27, 10.28 are an embarrassment to confidence theory, but fortunately they are 
rare. The significance of more general relevant subsets is less clear , 6 particularly 
when a number of such subsets are available. Especially awkward in this con¬ 
nection is the possibility [discussed by Buehler (1959)] of the existence of two 
relevant subsets C and C' with nonempty intersection and opposite bias. 


6 For a discussion of this issue, see Buehler (1959), Robinson (1976, 1979a), and 
Bondar (1977). 
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If a conditional confidence level is to be cited for some relevant subset C, it 
seems appropriate to take account also of the possibility that A may fall into 
C c and to state in advance the three confidence coefficients 7 , 7 c, and 7 c c - 
The (unknown) probabilities Pg(C ) and Pg(C c ) should also be considered. These 
points have been stressed by Kiefer, who has also suggested the extension to a 
partition of the sample space into more than two sets. For an account of these 
ideas see Kiefer (1977a,b), Brownie and Kiefer (1977), and Brown (1978). 

Kiefer’s theory does not consider the choice of conditioning set or statistic. The 
same question arose in Section 10.2 with respect to conditioning on ancillaries. 
The problem is similar to that of the choice of model. The answer depends on 
the context and purpose of the analysis, and must be determined from case to 
case. 


10.5 Problems 

Section 10.1 

Problem 10.1 Let the experiments of £ and T consist in observing X : 1V(£, oq) 
and A' : N(£,af) respectively (<ro < d), and let one of the two experiments be 
performed, with P(£) = P(P) = \ ■ For testing H : £ = 0 against £ = £ 1 , 
determine values no, ai, £ 1 , and a such that 

(i) ao < ai; (ii) «o > ai, 

where the a; are defined by (10.9). 


Problem 10.2 Under the assumptions of Problem 10.1, determine the most 
accurate invariant (under the transformation X' = —X) confidence sets S(X) 
with 


P(C 6 5(A) I £) + P(f 6 5(A) I P) = 2 7 . 

Find examples in which the conditional confidence coefficients 70 given £ and 71 
given T satisfy 

(i) 7o < 7U (ii) 7o > 7i- 

Problem 10.3 The test given by (10.3), (10.8), and (10.9) is most powerful 
under the stated assumptions. 

Problem 10.4 Let Ai,... ,X n be independently distributed, each with proba¬ 
bility p or q as N(^,ctq) or N(£,al). 

(i) If p is unknown, determine the UMP unbiased test of H : £ = 0 against 

K : £ > 0 . 

(ii) Determine the most powerful test of H against the alternative £1 when it 
is known that p = |, and show that a UMP unbiased test does not exist 
in this case. 
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(iii) Let ak (k = 0,..., n) be the conditional level of the unconditional most 
powerful test of part (ii) given that k of the X’s came from 1V(£, uq) an d 
n — k from N(£, af). Investigate the possible values ao, Qi,..., a n - 

Problem 10.5 With known probabilities p and q perform either £ or T, with 
X distributed as N(0, 1) under £ or N(—0, 1) under T. For testing H : 0 = 0 
against 8 > 0 there exist a UMP unconditional and a UMP conditional level-a 
test. These coincide and do not depend on the value of p. 

Problem 10.6 In the preceding problem, suppose that the densities of X under 
£ and T are 8 e~ ex and (1 /8)e~ x ^ e respectively. Compare the UMP conditional 
and unconditional tests of H : 8 = 1 against K : 0 > 1. 


Section 10.2 

Problem 10.7 Let A', Y be independently normally distributed as N(0, 1), and 
let V = Y - X and 

_ f Y — X if X + Y> 0, 

W \ X - Y if X + Y < 0. 

(i) Both V and W are ancillary, but neither is a function of the other. 

(ii) (V, W) is not ancillary. [Basu (1959).] 

Problem 10.8 An experiment with n observations AT,..., X n is planned, with 
each Xi distributed as N{0, 1). However, some of the observations do not ma¬ 
terialize (for example, some of the subjects die, move away, or turn out to be 
unsuitable). Let Ij = 1 or 0 as A, is observed or not, and suppose the Ij are 

independent of the A’s and of each other and that P(Ij = 1) = p for all j . 

(i) If p is known, the effective sample size M = JT Ij is ancillary. 

(ii) If p is unknown, there exists a UMP unbiased level-a test of H : 8 < 0 
vs. K : 8 > 0. Its conditional level (given M = m) is a m = a for all 
m = 0,..., n. 

Problem 10.9 Consider n tosses with a biased die, for which the probabilities 
of 1,..., 6 points are given by 

1 2 3 4 5 6 

1-9 2-9 3-9 1+9 2+9 3+9 

12 12 12 12 12 12 

and let X; be the number of tosses showing i points. 

(i) Show that the triple Z\ = AT + X 5 , Z 2 = X 2 + X 4 , Z 3 — A 3 + X 6 
is a maximal ancillary; determine its distribution and the distribution of 
AT, ... ,X 6 given Zx = zi, Z 2 = z 2 , Z 3 = z 3 . 

(ii) Exhibit five other maximal ancillaries. [Basu (1964).] 
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Problem 10.10 In the preceding problem, suppose the probabilities are given 
by 

1 _2_3_4_5_ 6 

1 — 0 1-20 1-30 1+0 1+20 1+3 0 

6 6 6 6 6 6 

Exhibit two different maximal ancillaries. 


Problem 10.11 Let A be uniformly distributed on (8, 9 + 1), 0 < 9 < oo, let 
[A'] denote the largest integer < X, and let V = X — [A']. 


(i) The statistic V(X) is uniformly distributed on (0,1) and is therefore 
ancillary. 


(ii) The marginal distribution of [A] is given by 

. , _ ( [0\ with probability 1 — V{&), 

A = I ft] + 1 with probability V(8). 

(iii) Conditionally, given that V = v, [A] assigns probability 1 to the value [#] 
if V(9) < v and to the value [0] + 1 if V(6) > v. [Basu (1964).] 


Problem 10.12 Let A, Y have joint density 

p(x,y) = 2 f(x)f(y)F(0xy), 

where / is a known probability density symmetric about 0, and F its cumulative 
distribution function. Then 

(i) p(x, y) is a probability density. 

(ii) A' and Y each have marginal density / and are therefore ancillary, but 
(A', Y) is not. 

(iii) A' • Y is a sufficient statistic for 9. [Dawid (1977).] 


Problem 10.13 A sample of size n is drawn with replacement from a population 
consisting of N distinct unknown values (ai,..., ajv}- The number of distinct 
values in the sample is ancillary. 


Problem 10.14 Assuming the distribution (4.22) of Section 4.9, show that Z is 
S-ancillary for p = p + /(p + + p~). 


Problem 10.15 In the situation of Example 10.2.3, A' + Y is binomial if and 
only if A = 1. 


Problem 10.16 In the situation of Example 10.2.2, the statistic Z remains S- 
ancillary when the parameter space is SI = {(A, p) : p, < A}. 

Problem 10.17 Suppose A = (U, Z), the density of A factors into 
pe,e(x) = c(Q ,’d)ge(u\ z)h#(z)k{u, z), 
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and the parameters 9 , # are unrelated. To see that these assumptions are not 
enough to insure that Z is S-ancillary for 8 , consider the joint density 

C'(0,il)e-5 (u - 9)2 -5 (2 -’ ,)2 7( W ,2), 

where I(u,z) is the indicator of the set {(u, z) : u < z}. [Basu (1978).] 


Section 10.3 

Problem 10.18 Verify the density (10.16) of Example 10.3.2. 

Problem 10.19 Let the real-valued function / be defined on an open interval. 

(i) If / is logconvex, it is convex. 

(ii) If / is strongly unimodal, it is unimodal. 

Problem 10.20 Let Ai,...,X m and Yi,...,Y n be positive, independent ran¬ 
dom variables distributed with densities f(x/a) and g(y/r) respectively. If / and 
g have monotone likelihood ratios in (*, a) and ( y , r) respectively, there exists 
a UMP conditional test of H : t/ct < Ao against r/a > Ao given the ancillary 
statistics {/< = Xt/X m and Vj = Yj/Y n ( i = 1,..., m — 1; j = 1,..., n — 1). 

Problem 10.21 Let Vi,... ,V n be independently distributed as 1V(0,1), and 
given Vi == Vi ,..., 

V n = v n , let Xi (i = 1,..., n) be independently distributed as N(9iy, 1). 

(i) There does not exist a UMP test of H : 9 = 0 against K : 8 > 0. 

(ii) There does exist a UMP conditional test of H against K given the ancillary 
(Vi,...,V n ). [Buehler (1982).] 

Problem 10.22 In Example 10.3.3, 

(i) the problem remains invariant under G' but not under G; 

(ii) the statistic D is ancillary. 


Section 10.4 

Problem 10.23 In Example 10.4.1, check directly that the set C = {x : x < 
—k or x > k} is not a negatively biased semirelevant subset for the confidence 
intervals (A' — c, X + c). 

Problem 10.24 (i) Verify the posterior distribution of 0 given x claimed in 

Example 10.4.1. 

(ii) Complete the proof of (10.32). 

Problem 10.25 Let A be a random variable with cumulative distribution func¬ 
tion F. If E\X\ < oo, then F[x) dx and / 0 °°[1 — F(x)]dx are both finite. 
[Apply integration by parts to the two integrals.] 
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Problem 10.26 Let A have probability density f(x — 9 ), and suppose that 
E\X\ < oo. For the confidence intervals X — c < 9 there exist semirelevant but 
no relevant subsets. [Buehler (1959).] 

Problem 10.27 Let X \,... ,X n be independently distributed according to the 
uniform distribution U(9, 9 + 1). 

(i) Uniformly most accurate lower confidence bounds 9_ for 9 at confidence 
level 1 — a exist and are given by 

9 = max(X ( i) - fc, A (n) - 1), 

where Jf/r = min(AT,..., X n ), A( n j = max(Xi,..., X n ), and (1 — k) n = o 

(ii) The set C : X( n ) — xm > 1 — k is a relevant subset with Pe(([ < 9 \ C) = 1 
for all 9. 

(iii) Determine the uniformly most accurate conditional lower confidence 
bounds d(u) given the ancillary statistic V = A(„) — Am = v, and com¬ 
pare them with #. [The conditional distribution of Y = X(i\ given V = v 
is U(9,9 +l-v).} 

[Pratt (1961a), Barnard (1976).] 

Problem 10.28 (i) Under the assumptions of the preceding problem, the 

uniformly most accurate unbiased (or invariant) confidence intervals for 9 
at confidence level 1 — a are 

9 = max(X(i) + d, X(„)) — 1 <9 < min(A (1 ), — d) =9, 

where d is the solution of the equation 

2 d n =a if a < 1/2” -1 , 

2 d n — (2d — l) n = a if a > 1/2"” 1 . 

(ii) The sets C\ : A( n j — Am > d and C 2 '■ A(„) — Af!) < 2d — 1 are relevant 
subsets with coverage probability 

Pg[9 < 9 < 9 | Ci] = 1 and P e [9 < 9 < 9 \ C 2 \ = 0. 

(iii) Determine the uniformly most accurate unbiased (or invariant) conditional 
confidence intervals 9(v) < 9 < 9(v) given V = v at confidence level 
1 — a, and compare d(i>), 9(v), and 9(v) — 9(v) with the corresponding 
unconditional quantities. 

[Welch (1939), Pratt (1961a), Kiefer (1977a).] 

Problem 10.29 Suppose Ai and A '2 are i.i.d. with 

P{Xi =9-1} = P{Xt = 9 + 1} = i . 

Let C be the confidence set consisting of the single point (Ai + X 2 )/2 if AT ^ X 2 
and AT — 1 if AT = AT- Show that, for all 9, 


Pg{9 £C} = .75 , 
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but 


Pe{9 G C\X i = X 2 } = .5 

and 

Pe{d G C\X { + X 2 } = 1 . 

[Berger and Wolpert (1988)] 

Problem 10.30 Instead of conditioning the confidence sets 9 G S(X) on a set 
C, consider a randomized procedure which assigns to each point x a probability 
ip(x) and makes the confidence statement 9 G S(x) with probability ip(x ) when 
x is observed. 7 

(i) The randomized procedure can be represented by a nonrandomized condi¬ 
tioning set for the observations ( X , U), where U is uniformly distributed 
on (0,1) and independent of X, by letting C = {( x,u ) : u < ip(x)}. 

(ii) Extend the definition of relevant and semirelevant subsets to randomized 
conditioning (without the use of U). 

(iii) Let 9 G S(X) be equivalent to the statement X G A(9). Show that ip is 
positively biased semirelevant if and only if the random variables ip(X) and 
Ia<b)(X ) are positively correlated, where I a denotes the indicator of the 
set A. 

Problem 10.31 The nonexistence of (i) semirelevant subsets in Example 10.4.1 
and (ii) relevant subsets in Example 10.4.2 extends to randomized conditioning 
procedures. 


10.6 Notes 

Conditioning on ancillary statistics was introduced by Fisher (1934, 1935, 1936). 8 
The idea was emphasized in Fisher (1956b) and by Cox (1958), who motivated 
it in terms of mixtures of experiments providing different amounts of infor¬ 
mation. The consequences of adopting a general principle of conditioning in 
mixture situations were explored by Birnbaum (1962) and Durbin (1970). Follow¬ 
ing Fisher’s suggestion (1934), Pitman (1938b) developed a theory of conditional 
tests and confidence intervals for location and scale parameters. For recent para¬ 
dox concerning conditioning on an ancillary statistic, see Brown (1990) and Wang 
(1999). 

The possibility of relevant subsets was pointed out by Fisher (1956a,b) (who 
called them recognizable. Its implications (in terms of betting procedures) were de¬ 
veloped by Buehler (1959), who in particular introduced the distinction between 
relevant and semirelevant, positively and negatively biased subsets, and proved 


' Randomized and nonrandomized conditioning is interpreted in terms of betting 
strategies by Buehler (1959) and Pierce (1973). 

®Fisher’s contributions to this topic are discussed in Savage (1976, pp. 467—469). 
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the nonexistence of relevant subsets in location models. The role of relevant sub¬ 
sets in statistical inference, and their relationship to Bayes and admissibility 
properties, was discussed by Pierce (1973), Robinson (1976, 1979a,b), Bondar 
(1977), and Casella (1988), among others. 

Fisher (1956a, b) introduced the idea of relevant subsets in the context of 
the Behrens-Fisher problem. As a criticism of the Welch-Aspin solution, he es¬ 
tablished the existence of negatively biased relevant subsets for that procedure. 
It was later shown by Robinson (1976) that no such subsets exist for Fisher’s 
preferred solution, the so-called Behrens-Fisher intervals. This fact may be re¬ 
lated to the conjecture [supported by substantial numerical evidence in Robinson 
(1976) but so far unproved] that the unconditional coverage probability of the 
Behrens-Fisher intervals always exceeds the nominal level. For a review of these 
issues, see Wallace (1980) and Robinson (1982). 

Maata and Casella (1987) examine the conditional properties of some con¬ 
fidence intervals for the variance in the one-sample normal problem. The 
conditional properties of some confidence sets for the multivariate normal mean, 
including confidence sets centered at James-Stein or shrinkage estimators, see 
Casella (1987) and George and Casella (1994). The conditional properties of the 
standard confidence sets in a normal linear model are studied in Hwang and 
Brown (1991). 

In testing a simple hypothesis against a simple alternative, Berger, Brown and 
Wolpert (1994) present a conditional frequentist methodology that agrees with a 
Bayesian approach. 
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Basic Large Sample Theory 


11.1 Introduction 

Chapters 3-7 were concerned with the derivation of UMP, UMP unbiased, and 
UMP invariant tests. Unfortunately, the existence of such tests turned out to be 
restricted essentially to one-parameter families with monotone likelihood ratio, 
exponential families, and group families, respectively. Tests maximizing the min¬ 
imum or average power over suitable classes of alternatives exist fairly generally, 
but are difficult to determine explicitly, and their derivation in Chapter 8 was 
confined primarily to situations in which invariance considerations apply. 

Despite their limitations, these approaches have proved their value by applica¬ 
tion to large classes of important situations. On the other hand, they are unlikely 
to be applicable to complex new problems. What is needed for such cases is a 
simpler, less detailed, more generally applicable formulation. The development 
and implementation of such an approach will be the subject of the remaining 
chapters. It replaces optimality by asymptotic optimality obtained by embed¬ 
ding the actual situation in a sequence of situations of increasing sample size, 
and applying optimality to the limit situation. These limits tend to be of the 
simple type for which optimality has been established in earlier chapters. 

A feature of asymptotic optimality is that it refers not to a single test but to a 
sequence of tests, although this distinction will often be suppressed. An important 
consequence is that asymptotically optimal procedures - unlike most optimal 
procedures in the small sample approach - are not unique since many different 
sequences have the same limit. In fact, quite different methods of construction 
may lead to procedures which are asymptotically optimal. 
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The following are some specific examples to keep in mind where finite sample 
considerations fail to provide optimal procedures, but for which a large sample 
approach will seen to be more successful. 


Example 11.1.1 (One parameter families) Suppose Xi,.. ,.,X n are i.i.d. 
according to some family of distributions Pg indexed by a real-valued param¬ 
eter 9. Then, it was mentioned after Corollary 3.4.1 that UMP tests for testing 
9 = 9o against 9 > 9q exist for all sample sizes (under weak regularity conditions) 
only when the distributions Pg constitute an exponential family. For example, lo¬ 
cation models typically do not have monotone likelihood ratio, and so UMP tests 
rarely exist in this situation, though the normal location model is a happy ex¬ 
ception. On the other hand, we shall see that under weak assumptions, there 
generally exist tests for one-parameter families which are asymptotically UMP in 
a suitable sense; see Section 13.3. For example, we shall derive an asymptotically 
optimal one-sided test in the Cauchy location model, among others. ■ 

Example 11.1.2 (Behrens-Fisher Problem) Consider testing the equality 
of means for two independent samples, from normal distributions with possibly 
different (unknown) variances. As previously mentioned, finite sample optimality 
considerations such as unbiasedness or invariance do not lead to an optimal test, 
even though the setting is a multiparameter exponential family. An optimal test 
sequence will be derived in Example 13.5.4. ■ 


Example 11.1.3 (The Chi-squared Test) Consider n multinomial trials with 
k + 1 possible outcomes, labelled 1 to k + 1. Suppose Pj denotes the probability 
of a result in the j th category. Let Y) denote the number of trials result¬ 
ing in category j, so that (Yi,..., Yfc+i) has the multinomial distribution with 
joint density obtained in Example 2.7.2. Suppose the null hypothesis is that 
p = 7r = (7Ti,..., 7Tfc+i). The alternative hypothesis is unrestricted and includes 
all p ^ 7r (with Pj = 1)- The class of alternatives is too large for a UMP 

test to exist, nor do unbiasedness or invariance considerations rescue the problem. 
The usual Chi-squared test, which is based on the test statistic Q n given by 


k +1 


Qn = 

3 =1 


(Yj — niTj ) 2 


( 11 . 1 ) 


will be seen to posses an asymptotic maximin property; see Section 14.3. ■ 


Example 11.1.4 (Nonparametric Mean) Suppose X \,..., X n are i.i.d. from 
a distribution F with finite mean p and finite variance. The problem is to test 
p = 0. Except when F is assumed to belong to a number of simple parametric 
families, optimal tests for the mean rarely exist. Moreover, if we assume only a 
second moment, it is impossible to construct reasonable tests that are of a given 
size (Theorem 11.4.6). But, by making a weak restriction on the family, we will 
see that it is possible to construct tests that are approximately level a and that 
in addition possess an asymptotic maximin property; see Section 11.4. ■ 

In the remaining chapters, we shall consider hypothesis testing and estimation 
by confidence sets from a large sample or asymptotic point of view. In this ap¬ 
proach, exact results are replaced by approximate ones that have the advantage 
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of both greater simplicity and generality. But, the large sample approach is not 
just restricted to situations where no finite sample optimality approach works. 
As the following example shows, limit theorems often provide an easy way to 
approximate the critical value and power of a test (whether it has any optimality 
properties or not). 

Example 11.1.5 (Simple vs. Simple) Suppose that Xi,..., X n are i.i.d. 
with common distribution P. The problem is to test the simple null hypothe¬ 
sis P = Pq versus the simple alternative P = Pi. Let pi denote the density of Pi 
with respect to a measure p. By the Neyman-Pearson Lemma, the optimal test 
rejects for large values of ^17=1 l°g|jpi(A’i)/po(A’j)]. The exact null distribution of 
this test statistic may be difficult to obtain since, in general, an n-fold integration 
is required. On the other hand, since the statistic takes the simple form of a sum 
of i.i.d. variables, large sample approximations to the critical value and power 
are easily obtained from the Central Limit Theorem (Theorem 11.2.4).■ 

Another application of the large sample approach (discussed in Section 11.3) 
is the study of the robustness of tests when the assumptions under which they 
are derived do not hold. Here, asymptotic considerations have been found to be 
indispensable. The problem is just too complicated for the more detailed small 
sample methods to provide an adequate picture. In general, two distinct types 
of robustness considerations arise, which may be termed robustness of validity 
and robustness of efficiency; this distinction has been pointed out by Tukey and 
McLaughin (1963), Box and Tiao (1964), and Mosteller and Tukey (1977). For 
robustness of validity, the issue is whether a level a test retains its level and 
power if the parameter space is enlarged to include a wider class of distributions. 
For example, in testing whether the mean of a normal population is zero, we may 
wish to consider the validity of a test without assuming normality However, even 
when a test possesses a robustness of validity are its optimality properties pre¬ 
served when the parameter space is enlarged? This question is one of robustness 
of efficiency (or inference robustness). In the context of the one-sample normal 
location model, for example, one would study the behavior of procedures (such as 
a one-sample t-test) when the underlying distribution has thicker tails than the 
normal, or perhaps when the observations are not assumed independent. Large 
sample theory offers valuable insights into these issues, as will be seen in Section 
11.3. 

When finite and large sample optimal procedures do not exist for a given 
problem, it becomes important to determine procedures which have at least rea¬ 
sonable performance characteristics. Large sample considerations often lead to 
suitable definitions and methods of construction. An example of this nature that 
will be treated later is the problem of testing whether an i.i.d. sample is uniformly 
distributed or, more generally, of goodness of fit. 

As the starting point of a large sample theory of inference, we now define 
asymptotic analogs of the concepts of size, level of significance, confidence coeffi¬ 
cient and confidence level. Suppose that data X^ comes from a model indexed 
by a parameter 0 £ fi. Typically, X^ refers to an i.i.d. sample of n observations, 
and an asymptotic approach assumes that n —v oo. Of course, two-sample prob¬ 
lems can be considered in this setup, as well as more complex data structures. 
Nothing is assumed about the family fi, so that the problem may be parametric 
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or nonparametric. First, consider testing a null hypothesis H that 9 £ Qh versus 
the alternative hypothesis K that 9 £ Qk, where Qh and Qk are two mutually 
exclusive subsets of Q. We will be studying sequences of tests (p„(X^ n ' > ). 

Definition 11.1.1 For a given level a, a sequence of tests {< fi „} is pointwise 
asymptotically level a if, for any 9 £ SI#, 

limsupi79[0 T t(A'^ rl ' ) )] < a . (11-2) 

n—>oo 

Condition (11.2) guarantees that for any 9 £ Qh and any e > 0, the level of 
the test will be less than or equal to a + t when n is sufficiently large. However, 
the condition does not guarantee the existence of an no (independent of 9) such 
that 

E e [MX (n) )]<a + e 

for all 9 £ Qh and all n > no- We can therefore not guarantee the behavior of 
the size 

sup E g [<p n ( X (n) )] 

6^0. h 

of the test, no matter how large n is. 

Example 11.1.6 (Uniform versus Pointwise Convergence) To illustrate the 
above point, consider the function 

/(n, 9) = a + (1 — a) exp(— n/9) , 

defined for positive integers n and 9 > 0. Then, for any 9 > 0, f(n, 9) —» a as 
n —» oo; that is, f(n, 9) converges to a pointwise in 9. However, this convergence 
is not uniform in 9 because 

sup/(n, 9) — a + (1 — a) sup exp {—n/9) = 1 . 

0>O 0>O 

To cast this example in the context of hypothesis testing, assume X \,..., X n are 
i.i.d. with the exponential distribution function 

F e (t) = P e {Xi <t} = 1 - exp(—t/9) . 

Define 

0„(AT,..., A n ) = a + (1 - a)/{min(A'i,... % X n ) > 1} . 

Here and throughout, the notation I{E} denotes an indicator random variable 
that is 1 if the event E occurs and is 0 otherwise. Then, Ee[</> n {Xi ,..., A n )] = 
f(n, 9). Hence, if Qh is the positive real line, the test sequence </> n satisfies (11.2), 
but its size is 1 for every n. ■ 

In order to guarantee the behavior of the limiting size of a test sequence, we 
require the following stronger condition. 

Definition 11.1.2 The sequence { <j> n } is uniformly asymptotically level a if 
limsup sup Ee[<f> n ( A^)] < a . 

n—> oo 


(11.3) 
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If instead of (11.3), the sequence {<(>„} satisfies 

lim sup Ee[(pn{X (n ' , )\ = a , (11-4) 

n->oo e<=n H 

then this value of a is called the limiting size of {<(>„}. 

Of course, we also will study the behavior of tests under the alternative hy¬ 
pothesis. The following is a weak condition that we expect reasonable tests to 
satisfy. 

Definition 11.1.3 The sequence {4>n} is pointwise consistent in power if, for 
any 9 in Qk, 

E e \ct> n (X (n) )\ -*• 1 (11.5) 

as n —¥ oo. 

Example 11.1.7 (One-parameter families, Example 11.1.1, continued) 

Let T n = T n (X i,... ,X n ) be a sequence of statistics, with distributions depend¬ 
ing on a real-valued parameter 9. For testing H : 9 = do against K : 9 > 9o, 
consider the tests <j>n that reject H when T n > C„. In many applications, it will 
turn out that, when 9 = 9o, n 1 ^ 2 (T n — 9o) has a limiting normal distribution with 
mean 0 and variance r 2 (#o) in the sense that, for any real number t, 

P„ 0 {n 1/2 (T„ - 0 o ) <t}^ $(t/T(9 0 )) , (11.6) 

where <!>(•) is the standard normal c.d.f. Let z a satisfy &(z a ) = a. Then, the test 
with 

C -9 + T r 
n 1 / 2 

has limiting size a, since 

Pe 0 {T n >e 0 + 1 ^z 1 - a }^a . 

Consider next the power of (j> n under the assumption that not only (11.6) holds, 
but that it remains valid when 9o is replaced by any 9 > 9q. Then, the power of 
4> n against 9 is 

0n(9) = Pe{n 1/2 {T n - 9) > zi- a T(0 o ) - n 1/2 {9 - 9 0 )} 

and hence /3 n {9) —» 1 for any 9 > 9o, so that the test sequence is pointwise 
consistent in power. ■ 

Similar definitions apply to the construction of confidence sets. Let g = g(9 ) 
be the parameter function of interest, for some mapping g from Q. to some space 
f lg. Let S„ — Sn(X^) £ denote a sequence of confidence sets for g(9). 

Definition 11.1.4 A sequence of confidence sets S n is pointwise asymptotically 
level 1 — a if, for any 9 £ ff, 

lim inf Pe{g(9) £ S n (X (rl) )} > 1 - a . 

71 —> OO 


(11.7) 
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The sequence {Sn} is uniformly asymptotically level 1 — a if 

liminf inf Pg{g(6) € S n (X (rl ■*)} > 1 — a . (11-8) 

n—>o o 2 

If the liminf in the left hand side of (11.8) can be replaced by a lim, then the 
left hand side is called the limiting confidence coefficient for {SW}. 

Most of the asymptotic theory we shall consider is local in a sense that we now 
briefly describe. In the hypothesis testing context, any reasonable test sequence 
4> n is pointwise consistent in power. However, any actual situation has finite 
sample size n and its power against any fixed alternative is typically less than 
one. In order to obtain a meaningful assessment of power, one therefore considers 
sequences of alternatives 9 n tending to Qh at a suitable rate, so that the limiting 
power of (f)n against 9„ is less than one. (See Example 11.2.5 for a simple example 
of such a local approach.) 

An alternative to the local approach is to consider the rate at which the power 
tends to one against a fixed alternative. Although there exists a large literature 
on this approach based on large-deviation theory, the resulting approximations 
tend to be less accurate and we shall not treat this topic here. 

It is also important to mention that asymptotic results may provide poor 
approximations to the actual finite sample setting. Furthermore, convergence to a 
limit as n —» oo certainly does not guarantee that the approximation will improve 
with increasing n; an example is provided by Hodges (1957). Any asymptotic 
result should therefore be accompanied by an investigation of its reliability for 
finite sample sizes. Such checks can be carried out by simulations studies or higher 
order asymptotic analysis. 

The concepts and definitions presented in this introduction will be explored 
more fully in the remaining chapters. First, we need techniques to be able to 
approximate significance levels, power functions, and confidence coefficients. To 
this end, the next section is devoted to useful results from the theory of weak 
convergence and other convergence concepts. 


11.2 Basic Convergence Concepts 

11.2.1 Weak Convergence and Central Limit Theorems 

In this section, the basic notation, definitions and results from the theory of weak 
convergence are introduced. The main theorems will be presented without proof, 
but we will provide illustrations of their use. For a more complete background, 
the reader is referred to Pollard (1984), Dudley (1989) or Billingsley (1995). 

Let A' denote a k x 1 random vector (which is just a vector-valued random 
variable), so that the ith component X\ of A is a real-valued random variable. 
Then, X T = (Ai,...,X k). The (multivariate) cumulative distribution function 
(c.d.f.) of X is defined to be: 

F x (x i ,... ,x k ) = P{Xi <xi,...,X k < x k } • 

Here, the probability P refers to the probability on whatever space X is defined. 
A point x T — (*i,..., Xk) at which the c.d.f. Fx(-) is continuous is called a 
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continuity point of Fx- Alternatively, x is a continuity point of Fx if the boundary 
of the set of ( y\,... ,yk) such that yi < Xi for all i has probability 0 under the 
distribution of A'. 1 2 As an example, the multivariate normal distribution was first 
studied in Section 3.9.2. 

Definition 11.2.1 A sequence of random vectors { X n } with c.d.f.s {.Fx„(-)} 
is said to converge in distribution (or in law) to a random vector X with c.d.f. 

F x { •) if 

F Xn (x i,... ,Xk) -* F x (x i, ...,Xk) 

at all continuity points (xi,...,Xk) of Fx(-)- This convergence will also be 
denoted X n 4 X. Because it really only has to do with the laws of the ran¬ 
dom variables (and not with the random variables themselves), we may also 

^ 2 

equivalently say F Xn converges weakly to F x , written F Xn —> Fx- 

The limiting random vector X plays an auxiliary role, since any random 
variable with the same distribution would serve the same purpose. Therefore, 
the notation will sometimes be abused so that we also say X n converges in 
distribution to the c.d.f. F, written X n 4 F. 

There are many equivalent characterizations of weak convergence, some of 
which are recorded in the next theorem. 

Theorem 11.2.1 (Portmanteau Theorem) Suppose X n and X are random 
vectors in IR fc . The following are equivalent: 

(i) X n 4 X. 

(ii) Ef(X n ) —» Ef(X) for all bounded, continuous real-valued functions f. 
(Hi) For any open set O in IR fc , liminf P(X n G O) > P(X G O). 

(iv) For any closed set G in IR fc , limsup P(X n gG)< P(X G G). 

(v) For any set E in lR k for which dE, the boundary of E, satisfies P(X G 
dE ) = 0, P(X n eE)^ P(X G E). 

(vi) liminf Ef(X n ) > Ef(X) for any nonnegative continuous f. 


1 In general, the boundary of a set E in lR k , denoted dE is defined as follows. The 
closure of E, denoted E, is the set of x G JR k for which there exists a sequence x n G E 
with x n —> x. The set E is closed if E = E. The interior of E , denoted E° , is the set 
of x such that, for some e > 0, the Euclidean ball with center x and radius e, defined by 
{y G !R fc : \y — x\ < e}, is contained in E. Here | • | denotes the usual Euclidean norm. 
The set E is open if E = E° . If E c denotes the complement of a set E, then evidently, 
E° is the complement of the closure of E c , and so E is open if and only if E c is closed. 
The boundary dE of a set E is then defined to be E — E° = E fl (E°) c . 

2 The term weak convergence (also sometimes called weak star convergence) distin¬ 
guishes this type of convergence from stronger convergence concepts to be discussed 
later. However, the term is used because it is a special case of convergence in the weak 
star topology for elements in a Banach space (such as the space of signed measures on 
]R fc ), though we will make no direct use of any such topological notions. 
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Another equivalent characterization of weak convergence is based on the notion 
of the characteristic function of a random vector. 

Definition 11.2.2 The characteristic function of a random vector X (taking 
values in IR fe ) is the function Cx(-) from lR fc to the complex plane given by 

c x(t) = E(e i{t ’ x) ). 

In the definition, (t, X) refers to the usual inner product, so that (t, A'} = 
X]j=i tjXj. Two important properties of characteristic functions are the follow¬ 
ing. First, the distribution of X is uniquely determined by its characteristic 
function. Second, the characteristic function of a sum of independent real¬ 
valued random variables is the product of the individual characteristic functions 
(Problem 11.7). 

Example 11.2.1 (Multivariate Normal Distribution) Suppose a random 
vector X T = (Ai ,... , A*,) is N(fj ,, E), the multivariate normal distribution with 
mean vector /r T = (/ri,..., Hk) and covariance matrix E. In the case k = 1, if A' 
is normally distributed with mean fj, and variance o 2 , its characteristic function 
is: 

E{e itx ) = [°° e «*_^e [ - ( *— m )2/ 2 - 2] dx = exp(ft/r - \a 2 t 2 ) , (11.9) 

J -oo 'J'l'KCJ 2 

which can be verified by a simple integration (Problem 11.8). To obtain the 
characteristic function for k > 1, note that 

c x(t) = E(e i{t ’ x) ) 

is the characteristic function 

C< t ,*>(A ) = E{e xi{t ’ x) ) 

of ( t , X) evaluated at A = 1. Now if X is multivariate normal N(fx ,, E), then ( t , A') 
is univariate normal with mean (t, fi) and variance (St, t) = t T Et. Therefore, by 
the case k = 1, we find that 

E(e l(t ’ x) ) = exp(i(f,/r) - ^(St,t)) . ■ (11.10) 

Theorem 11.2.2 (Continuity Theorem) X n -4 A' in IR fe if and only if 

C X n (t) -¥ c x(t) 

for all t in IR fe . 

Note that it is not enough to assume £x n (f) —> C (f) for some limit function 
£(•) in order to conclude A'„ -4 A'; one must know that £(•) is the characteristic 
function of some random variable (or that £(•) is continuous at 0) (Problem 11.9). 

Weak convergence of random vectors on IR fe can be reduced to studying weak 
convergence on the real line by means of the following result, the proof of which 
follows immediately from Theorem 11.2.2 (Problem 11.10). 

Theorem 11.2.3 (Cramer-Wold Device) A sequence of random vectors X n 
on M k satisfies X n A X iff (t , X n ) 4 (t, A) for every t £ IR fc . 
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The following result is crucial for this and the following chapters. 

Theorem 11.2.4 (Multivariate Central Limit Theorem) Let Xj( = 

{X n ,i, ■ ■ ■ ,X n> k) be a sequence of i.i.d. random vectors with mean vector p T = 
(n i,..., fj,t) and covariance matrix E. Let X n j = i x i,i- Then 

(n 1/2 (X n ,i -A*i),, n 1/2 (X„, fc - p k )f 4 N( 0, E) . 

To cover situations in which the distribution varies with sample size, we will 
deal with a triangular array of variables {X„ t i : 1 < i < r n , n = 1,2,...}, where 
it is assumed r n —> oo as n —> oo. Typically, r„ = n, and so the term triangular 
array is an appropriate description, but note that the term triangular array is 
used even if r„ ^ n. The following limit theorem provides sufficient conditions 
for asymptotic normality for a normalized sum of real-valued variables making 
up a triangular array. (See Billingsley (1995), p. 369.) 

Theorem 11.2.5 (Lindeberg Central Limit Theorem) Suppose, for each n, 
X n ,i,..., X n ^ n are independent real-valued random variables. Assume E(X ni i) = 
0 and a 2 ,i = E(X^ i ) < oo. Let s 2 = J]i=i a n,i■ Suppose, for each t > 0, 

r » i 

^2 -^E[Xn,iI{\X n ,i\ > es n }] -> 0 as n -¥ oo. (11.11) 

;=i 

Then, E£i x n,i/s n 4 JV(0,1). 

For most applications, Lindeberg’s Condition (11.11) can be verified by Lya- 
pounov’s Condition, which says that, for some 5 > 0, |A„ i i| 2+<s are integrable 
and 

r " 1 

lim E -^+S E l\ X n,i\ 2+S ] = 0 ■ (H.12) 

n—>oo f <3~L~ 

i=1 bn 

Indeed, (11.12) implies (11.11) (Problem 11.11), and the result may be stated as 
follows. 

Corollary 11.2.1 (Lyapounov Central Limit Theorem). Suppose, for each 
n, A'„,i, • • • ,X n ,r n are independent. Assume E( A'„,i) = 0 and a^ } i = E( X 2 ^) < 
oo. Let s 2 = cr 2 ^. Suppose, for some 5 > 0, (11.12) holds. Then, 

T,i=l X n,i/s n 4 JV(0, 1). 

There also exists a partial converse to Lindeberg’s Central Limit Theorem, due 
to Feller and Levy. (See Billingsley (1995), p. 574.) 

Theorem 11.2.6 Suppose, for each n, X n p,.. ■, X n ^ n are independent, mean 
0, c 2 ,; = E(X^ i) < oo and s 2 = J4=i a n,i- Also, assume the array is uniformly 
asymptotically negligible; that is, 

max P{\X n ,i/s n \ > e} -> 0 (11.13) 

l<i<r n 

for any e > 0. 7/54= i X n ,i/s n 4 IV(0,1) , then the Lindeberg Condition (11.11) 
is satisfied. 



428 11. Basic Large Sample Theory 


Corollary 11.2.2 Suppose, for each n, A'„,i,..., X n , n are i.i.d. with mean 0 
and variance Let sn = na\. Assume X)" = i-Xn.i/sn ^V(0,1). Then, the 
Lindeberg Condition (11.11) is satisfied. 

Corollary 11.2.2 follows from Theorem 11.2.6 because the assumption that 
the nth row of the triangular array is i.i.d. implies the array is uniformly 
asymptotically negligible, so that the condition (11.13) holds. Indeed, 

P{\Xn,i\/s n >e}< ^ 0 . 

The following Berry-Esseen Theorem gives information on the error in the 
normal approximation provided by the Central Limit Theorem. 


Theorem 11.2.7 Suppose Xi,...,X„ are i.i.d. real-valued random variables 
with c.d.f. F. Let p(F) denote the mean of F and let cr 2 (F) denote the vari¬ 
ance of F, assumed finite and nonzero. Let S n = XX=i -XL Then, there exists a 
universal constant C (not depending on F, n, or x) such that 


P 


S n - np(F) 
n 1 l 2 a(F) 



<f?(a:) 


C EfWXx - /i(E)| 3 ] 
— n 1 / 2 <r(.F ) 3 


(11.14) 


where $(•) denotes the standard normal c.d.f. 


The Berry-Esseen Theorem holds if C — 0.7975. The smallest value of C for 
which the result holds is unknown, but it is known that it fails for C < 0.4097 
(van Beek (1972)). 

If F is a fixed distribution with finite third moment and nonzero variance, the 
right side of (11.14) tends to zero and hence the left side of (11.14) tends to zero 
uniformly in x. Furthermore, if F is the family of distributions F with 


E F [\X - p(F)\ 3 } 

a 3 (F) 


< B , 


(11.15) 


for some fixed B < oo, then this convergence is also uniform in F as F varies in 
F. Thus, if S n is the sum of n i.i.d. variables with distribution F„ in F, then 


sup 

X 


P 


S n np(F n ) _ ! 

nV*v(F n ) - X j 


<t>(a;) 


0 . 


(11.16) 


Example 11.2.2 Suppose X \,..., X„ are i.i.d. Bernoulli trials with probability 
of success p. Then, S n = -X* is binomial based on n trials and success prob¬ 

ability p, and the usual Central Limit Theorem asserts that the probability that 
(S n — np)/[np( 1 — p)} 1 ^ 2 is less or equal to x converges to 4>(a;), if p is not zero or 
one. It follows from the Berry-Esseen theorem that this convergence is uniform 
in both x and p as long as p £ [e, 1 — e] for some e > 0. To see why, we show that 
condition (11.15) is satisfied. Observe that 

E[\Xi -p| 3 ] =p(l -p)[(l -p) 2 +p 2 ] <p(l~p) ■ 


Thus, 


^r-p| 3 ]/b(l-p)] 3 / 2 <[e(l-e )]- 1/2 
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so that (11.15) holds with B 2 = e(l — e). Thus, (11.16) holds, so that if S n is 
binomial based on n trials and success probability p n —> p £ (0,1), then 

P hn P %-7:)Y'^ Xn} ^* {X) (iLi7) 

whenever 

Example 11.2.3 (The Sample Median) As an application of the Berry- 
Esseen theorem and the previous example, the following result establishes the 
asymptotic normality of the sample median. Given a sample X \,..., X n with 
order statistics A/p < ••• < Xt n \, the median X n is defined to be the middle 
order statistic Xq^ if n = 2k — 1 is odd and the average of X (*.) and X(*. +1 ) if 
n — 2k is even. 


Theorem 11.2.8 Suppose Xi,...,X n are i.i.d. real-valued random variables 
with c.d.f. F. Assume F{9) = 1/2, and that F is differentiable at 9 with F' = / 
and f(9) > 0. Let X n denote the sample median. Then 

n 1/2 (Xn -9) 4 N(0, ^y) . 


Proof. Assume first that n tends to oo through odd values and, without loss of 
generality, that 9 = 0. Fix any real number a and let S„ be the number of Xi 
that exceed a/n 1 / 2 . Then the event {X n < o/n 1//2 } is equivalent to the event 
{S n < (n— l)/2}. But, S n is binomial with parameters n and success probability 
p„ = 1 — F(a/n 1 / 2 ). Thus, 

PW' 2 Xn < a } = = n { ^-_ n ;; )]1/2 < x. } , 

where 

|(n — 1) — npn _ n 1/2 {\ - p n ) - l/(2n 1/2 ) 

Xn ~ [np„(l - Pn)} 1 ' 2 ~ [pn(l - Pn)} 1 / 2 ' 

As n —> oo, p n —> 1/2 and 


n 


1 / 2 / 



F(a/n 1/2 ) - F{ 0) 
a/n C 2 


o/(0) , 


which implies x n —> 2a/(0). Therefore, by (11.17), 


P{n 1/2 X n < a} $[2/(0)o] , 


which completes the proof for odd n. For the case of even n, see Problem 11.15. ■ 
Another result concerning uniformity in weak convergence is the following 
theorem of Polya. 


Theorem 11.2.9 (Polya’s Theorem) Suppose X n —t X and X has a continu¬ 
ous c.d.f Fx- Let Fx n denote the c.d.f. of X n . Then, Fx n {x) converges to Fx(x), 
uniformly in x. 

It is interesting and important to know that weak convergence of F n to F can be 
expressed in terms of p(F n , F), where p is a metric on the space of distributions. 
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(Some basic properties of metrics are reviewed in the appendix, Section A.2.) To 
be specific, on the real line, define the Levy distance between distributions F and 
G as follows. 

Definition 11.2.3 Let F and G be distribution functions on the real line. The 
Levy distance between F and G, denoted Pl(F, G) is defined by 

Pl(F, G) = inf{e > 0 : F(x — e) — e < G(x ) < F(x + e) + e for all x} . 

The definition implies that Pl(F,G) = pl(G,F) and that pL is a metric 
on the space of distribution functions (Problem 11.20). Moreover, if F n and F 
are distribution functions, then weak convergence of F n to F is equivalent to 
pL(Fn,F) —» 0 (Problem 11.22). In this sense, pL metrizes weak convergence. 

We shall next consider the implication of weak convergence for the convergence 
of quantiles. Ideally, the (1 — a ) quantile xi- a of a distribution F is defined by 

F(xi- a ) = 1 — a . (11.18) 

For the solutions of (11.18), it is necessary to distinguish three cases. First, if F 
is continuous and strictly increasing, the equation (11.18) has a unique solution. 
Second, if F is not strictly increasing, it may happen that F(x) = 1 — a on an 
interval [a, b) or [a, 6], so that any x in such an interval could serve as a 1 — a 
quantile. Then, we shall define the 1 — a quantile as the left hand endpoint of 
the interval. Third, if F has discontinuities, then (11.18) may have no solutions. 
This happens if F(x) > 1 — a and sup {F(y) : y < x} < 1 — a, but in this case 
we would call x the 1 — a quantile of F. A general definition encompassing all 
these possibilities is given by 

xi- a = inf{a: : F(x) > 1 — a} . (11.19) 

This is also sometimes written as x\- a = F 1_1 ( 1 — a) although F may not have 
a proper inverse function. 

Weak convergence of F n to F is not enough to guarantee that F~ 1 ( 1 — a) con¬ 
verges to F’ _1 (l — a), but the following result shows this is true if F is continuous 
and strictly increasing at T _1 (l — a). 

Lemma 11.2.1 (i) Let {F^} be a sequence of distribution functions on the real 
line converging weakly to a distribution function F. Assume F is continuous and 
strictly increasing at y = F -1 (l — a). Then, 

F r )~ 1 (l — a) —¥ F _1 (l — a) . 

(ii). More generally, suppose {F n } is a sequence of random distribution functions 
satisfying F n (x) —> F(x) at all x which are continuity points of some fixed distri¬ 
bution function F. Assume F is continuous and strictly increasing at F _1 (l — a). 
Then, 

F-\l-a)^F-\l-a) . 

Proof. To prove (i), fix S > 0. Let y — e and y + e be continuity points of F for 
some 0 < e < 5. Then, 

F n (y - e) -» F(y - e) < 1 - a 
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and 

F n (y + e) -> F(y + e) > 1 - a. 

Hence, for all sufficiently large n, 

V- e < F ~ 1 (1 - «) < y + e , 

and so, |F“ 1 (1 — a) — j/| <5 for all sufficiently large n. Since 5 was arbitrary, the 
result (i) is proved. The proof of (ii) is similar. ■ 


11.2.2 Convergence in Probability and Applications 

As pointed out earlier, convergence in law of A n to X asserts only that the 
distribution of X n tends to that of X, but says nothing about A'„ itself becoming 
close to X. The following stronger form of convergence provides that X n and A' 
themselves are close for large n. 


Definition 11.2.4 A sequence of random vectors {A'„} converges in probability 

p 

to A, written X n —> X , if, for every e > 0, 

P{\X n — X\ > e} —» 0 as n oo. 

Convergence in probability implies convergence in distribution (Problem 11.30); 
the converse is false in general. However, if A„ converges in distribution to a dis¬ 
tribution assigning probability one to a constant vector c, then A'„ converges in 
probability to c, and conversely. Note that, unlike weak convergence, X n and A' 
must be defined on the same probability space in order for Definition 11.2.4 to 
make sense. 

Convergence in probability of a sequence of random vectors A'„ is equiv¬ 
alent to convergence in probability of their components. That is, if X n = 
(A^i,..., A„,fc) T and A = (Ai,..., AT) t , then A n A A iff for each i = 1,..., k, 
X n ,i —> AMoreover, X n —> 0 if and only if |A n | A 0 (Problem 11.31). 

A sequence of real-valued random variables A n converges in probability to 

p 

infinity, written A'„ —> oo if, for any real number B, 

P {An <£>}—>• 0 


asn->oo. 

The next result and the later Theorem 11.2.16 deal with the convergence of 
the average of i.i.d. random variables toward their expectation, and are known 
as the weak and strong laws of large numbers. The terminology reflects the fact 
that the strong law asserts a stronger conclusion than the weak law. 


Theorem 11.2.10 (Weak Law of Large Numbers) Let X , be i.i.d. real¬ 
valued random variables with mean p. Then, 
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Note that it is possible for X n to converge in probability to a constant even 
if the mean does not exist (Problem 11.28). Also, if the X t are nonnegative and 
the mean is not finite, then X n —> oo (Problem 11.32). 

Suppose Xi ,..., X n are i.i.d. according to a model {Pg, 9 £ fi}. A sequence 
of estimators T„ = T n (Xi,...,X n ) is said to be a weakly consistent (or just 
consistent) estimator sequence of g(9 ) if, for each 9 £ fi, 

T n g{9) . 

Thus, the consistency of an estimator sequence merely asserts convergence in 
probability for each value of the parameter. For example, the Weak Law of Large 
Numbers asserts that the sample mean is a consistent estimator of the population 
mean whenever the population mean exists. 


Example 11.2.4 Suppose X\,. .. , X n are i.i.d. according to either Po or Pi. If 
Pi denotes the density of Pi with respect to a dominating measure, then by the 
Neyman-Pearson Lemma, an optimal test rejects for large values of 

1 " 

T n = - Vlog^pQ/MXi)] . 
n z ' 

4 = 1 

By the Weak Law of Large Numbers, under Po, 

T„4-A(P 0 ,Pi) , (11.20) 

where A'(Po,Pi) is the so-called Kullback-Leibler Information, defined as 

K(P 0 ,Pi) = -E Po [log(pi(Xi)/p 0 (Xi))] . (11.21) 

The convergence (11.20) assumes A'(Po,Pi) is well-defined in the sense that the 
expectation in (11.21) exists. But, by Jensen’s inequality (since the negative log 
is convex), 

I<(Po,Pi) >-\og[E Po (pi(Xi)/p 0 (Xi))} > 0 . 

If P 0 and Pi are distinct, then, the first inequality is strict, so that A'(Po, Pi) > 0 
with equality iff Po = Pi. Note, however, that A'(Po, Pi) may be oo, but even in 
this case, the convergence (11.20) holds; see Problem 11.33. Similarly, under the 
alternative hypothesis Pi, 

Tn 4 E Pl [log(pi(Xi)/po(A'i)] = A(Pi, Po) > 0 . 

Note that A'(Po, Pi) need not equal A'(Pi, Po). 

In summary, T n converges in probability, under Po, to a negative constant 
(possibly —oo), while, under Pi, T n converges in probability to a positive constant 
(assuming Po and Pi are distinct). Therefore, for testing Po versus Pi, the test 
that rejects when T„ > 0 is asymptotically perfect in the sense that both error 
probabilities tend to zero; that is, Po{Tn > 0} —>• 0 and Pi{T„ < 0} —> 0. It also 
follows that, for fixed a £ (0,1), if <f>„ is a most powerful level a test sequence 
for testing Po versus Pi based on n i.i.d. observations, then the power of </>„ 
against Pi tends to one. Thus, if Po and Pi are fixed with n —» oo, the problem 
is degenerate from an asymptotic point of view. ■ 
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For convergence in probability to a constant, it is not necessary for the X n 
to be defined on the same probability space. Suppose P„ is a probability on a 
probability space (fi n ,J- n ), and let X n be a random vector from to IR fc . 
Then, if c is a fixed constant vector in IR fc , we say that X n converges to c in 
Pn-probability if, for every t > 0, 

P n {\X n — c\ > e} —> 0 as n —» oo . 

Alternatively, we may say X n converges to c in probability if it is understood 
that the law of X„ is determined by P„. 

For a sequence of numbers x„ and y n , the notation x n = o(y n ) means x n /y n —> 
0 as n —^ oo. For random variables X n and Y n , the notation X n = op(Y n ) means 
X n /Y n —> 0. Similarly, X n = op n (Y n ) means X n /Y n —> 0 in P„-probability. 

The following theorem is very useful for proving limit theorems. 

Theorem 11.2.11 (Slutsky’s Theorem) Suppose {.¥„} is a sequence of real¬ 
valued random variables such that X n X. Further, suppose {A„} and {B n } 

P P d 

satisfy A n —» a, and B n —¥ b, where a and b are constants. Then, A n X n + B n —> 
aX + b. 

The conclusion in Slutsky’s Theorem may be strengthened to convergence in 

p 

probability if it is assumed that X„ —> X. The following corollary to Slutsky’s 
Theorem is also fundamental. 

Corollary 11.2.3 Suppose {X n } is a sequence of real-valued random variables 
such that X n tends to X in distribution, where X has a continuous cumulative 
distribution function F. If C n —> c in probability, where c is a constant, then 

P{X n < C n } P( C ) . 

Corollary 11.2.3 is useful even when C n are nonrandom constants tending to 
c. Also, the corollary holds even if c = oo or c = —oo (Problem 11.36), with the 
interpretation F(oo) = 1 and F(—oo) = 0. 

Note that Slutsky’s theorem holds more generally if the convergence in 
probability assumptions are replaced by convergence in P„-probability. 

Example 11.2.5 (Local Power Calculation) Suppose S n is binomial based 
on n trials and success probability p. Consider testing p = 1/2 versus p > 1/2. 
The uniformly most powerful test rejects for large values of S n . By Example 
11.2.2, 

Z„ = (S„-|)/(n/4) 1/2 4lV(0,l) , 

and so the test that rejects the null hypothesis when this quantity exceeds the 
normal critical value zi- a is asymptotically level a. Let f3 n (p) denote the power 
of this test against a fixed alternative p > 1/2. Then, (S n — np)/[np( 1 — p)] 1 ^ 2 is 
asymptotically standard normal if p is the true value. Hence, 

Pn ip) = P P {Z n > zi- a } = P p { _ p)]i /2 ^ dn{p)} , 
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where 


dn(p) = 


Z l-o 


[4p(l-p)]V» 


+ n 


1/2 _ 


[p(l ~P)] 1/2 


if p > 1/2. Thus, p n (p) —> 1 as n —¥ oo for any p > 1/2, and so the test sequence 
is pointwise consistent. 

This result does not distinguish between alternative values of p. Better dis¬ 
crimination is obtained by considering alternatives for which the power tends 
to a value less than 1. This is achieved by replacing a fixed alternative p by 
a sequence p n tending to 1/2, so that the task of distinguishing between 1/2 
and p n becomes more difficult as information accumulates with increasing n. It 
turns out that the power will tend to a limit less than one but greater than a if 
p n = 1/2 + hnT 1 ' 2 if h > 0. To see this, note that, by Example 11.2.2, under p n , 
(S n — np n )/[npn( 1 — Pn)] 1 / 2 is asymptotically standard normal. Then, 


Sn HPn 


Pn(Pn) — Pp n {Z n > Zl- a } — Pp n { (1 — p )] 1/ 2 ^ dn(Pn) } ■ 


But, d n (Pn) —> zi-a — 2 h. Hence, if Z denotes a standard normal variable, 


/3n{p n ) -t P{Z > Zl-a ~ 2 h} = 1 - <I>(zi- a - 2 K) . 

Also, note that /3 n (Pn) —> 1 if v}^ 2 (/p n — 1/2) —> co and —>• a if n}^ 2 (/pn — 

1/2) 0 (Problem 11.37). ■ 


The following is another useful result concerning convergence in probability. 


, p 

Theorem 11.2.12 Suppose X n and X are random vectors in JR with X n —> X. 

, p 

Let g be a continuous function from IR to IR S . Then, g(X n ) —> g(X). 


Example 11.2.6 (Sample Standard Deviation) Let Xi ,..., X n be i.i.d. real 
valued random variables with common mean p and finite variance a 2 . The usual 
unbiased sample variance estimator is given by 

1 n 

S 2 n = -- V(AL - X n ) 2 , (11.22) 

n — 1 A —' 
i= 1 

where X n = n _1 A/ is the sample mean. By the weak law of large numbers, 

X n —> p in probability and n _1 X 2 —1 E{X 2 ) = p 2 + a 2 in probability. 
Hence, 

i 71 

n — 1 q2 _-1 v2 1-2 . J2 

Sn ^ y Aj Xn t CF 

n z —' 

i= 1 

in probability, by Slutsky’s Theorem. Thus, S 2 —> a 2 in probability, which implies 
S n —> a in probability, by Theorem 11.2.12. ■ 


Example 11.2.7 (Confidence Intervals for A Binomial p) Suppose S n is 
binomial based on n trials and unknown success probability p. Let p n = S n /n. 
By Example 11.2.2, for any p € (0,1), nf^ 2 (pn — p) converges in distribution to 

p 

N(0,p(l —p)). This implies p n —> p and so 


[Pn( 1 ~Pn)} 1/2 4 [p(l-p)] 1/2 
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as well. Therefore, by Slutsky’s Theorem, for any p £ (0,1), 


n 1/2 {p n -p) 

[Pn( 1 -Pn)] 1 / 2 


4jv(0,1). 


This implies that the confidence interval 


Pn ± 


Pn( 1 Pn ) 
n 


1/2 


(11.23) 


is pointwise consistent in level, for any fixed p in (0,1), where zp is the (3 quantile 
of N(0, 1). Note, however, that this confidence interval is not uniformly consistent 
in level; in fact, for any n, the coverage probability can be arbitrarily close to 0 
(Problem 11.38). 

Unfortunately, an accumulating literature has shown that the coverage of the 
interval in (11.23) is quite unreliable even for large values of n or np( 1 — p), and 
varies quite erratically as the sample size increases. To cite just one example, the 
probability of the interval (11.23) covering the true p when p = .2 and 1 — a = .95 
is .946 when n = 30, and it is .928 when n = 98. This example is taken from 
Table 1 of Brown, Cai and DasGupta (2001), who survey the literature and 
recommend more reliable alternatives. Because of the great practical importance 
of the problem, we summarize some of their principal recommendations. 

For small n, the authors recommend two procedures. The first, which goes 
back to Wilson (1927), is based on the quadratic inequality 


I Pn 


p\ ^ *1-S 


p(l -p) 


1/2 


n 


(11.24) 


which has probability under p tending to 1 — a. So, if we were testing the simple 
null hypothesis that p is true, we can invert the test with acceptance region 
(11.24). Solving for p in (11.24), one obtains the Wilson interval (Problem 11.39) 


Pn ± Zl-i 


,1/2 


PnQn + 


_ 2 

4n 


1/2 


(11.25) 


where p n = S n /h, S n = S n + h — n + zf_a, and q„ = 1 — p n . As an 

alternative, the authors recommend an equal-tailed Bayes interval based on the 
Beta prior with a = b = 1/2; see Example 5.7.2. 

Theoretical and additional numerical support are provided in Brown, Cai and 
DasGupta (2002). Other approximations are reviewed in Johnson, Kotz and 
Kemp (1992). ■ 


Theorem 11.2.13 (Continuous Mapping Theorem) Suppose X n -4 A'. Let 

g be a (measurable) map from IR fe to FT. Let C be the set of points in IR*’ for which 
g is continuous. If P(X € C) = 1, then g(X n ) -4 g(X). 

Example 11.2.8 Suppose X„ is a sequence of real-valued random variables such 
that X n —> N(0,a 2 ). By the Continuous Mapping Theorem, it follows that 
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where xl denotes the Chi-squared distribution with k degrees of freedom. More 
generally, suppose X„ is a sequence of k x 1 vector-valued random variables such 
that 

x n 4 N( 0, E) , 

where E is assumed positive definite. Then, there exists a unique positive definite 
symmetric matrix C such that C ■ C = S and we write C = S 1 ^ 2 . (For the con¬ 
struction of the square root of a positive definite symmetric matrix, see Lehmann 
(1999), p.306.) By the Continuous Mapping Theorem, it follows that 

Ic-^x^^xl u 

The following method is often used to prove limit theorems, especially 
asymptotic normality. 

Theorem 11.2.14 (Delta Method) Suppose Xi,X 2 ,... and X are random 

vectors in IR fc . Assume r„(X n — g) 4 X where g is a constant vector and {r n } 
is a sequence of constants r n —> oo. 

(i) Suppose g is a function from IR fc to IR which is differentiable at g with gradient 
(vector of first partial derivatives) of dimension lx k at g equal to g(g)- 3 Then, 

Tn[g(X n ) - g(g)] 4 g{g)X . (11.26) 

In particular, if X is multivariate normal in IR fc with mean vector 0 and 
covariance matrix E, then 

Tn[g(X n ) - g(g)] 4 N(0,g(g)Zg(g) T ) ■ (11.27) 

(ii) More generally, suppose g = (gi,...,g q ) T is a mapping from IR fe to IR 9 , 
where gi is a function from IR fc to IR which is differentiable at g. Let D be the 
q x k matrix with ( i,j) entry equal to dgtfyi,... ,yk)/dyj evaluated at g. Then, 

T n {g(X n ) - g(g)] = T„[gi(X n ) - gi(g),.. .,g q {X n ) - g q (g)] T 4 DX . 

In particular, if X is multivariate normal in IR* with mean vector 0 and 
covariance matrix E, then 

T n [g(X n )-g(g)]AN(0,DZD T ) . 

Proof. We prove (i) with (ii) left as an exercise (Problem 11.44). Note that 
X n — g = op( 1). Differentiability of g at g implies 

g(x) = g(g) + g{g)(x - g) + R{x - g ), 

where R(y) = o(\y\) as \y\ —» 0. Now, 

Tn[g(X n ) - g(fj)] - g(y)r n (Xn - y) = r n R(X n - y) . 

By Slutsky’s Theorem, it suffices to show r n R(X n — y) = op( 1). But, 

TnRi^X-n /i) — Tn\X n /i| * h(X n /i) , 


3 When k = 1, we may also use the notation g'(y) for the ordinary first derivative of 
g with respect to y, as well as g"(y) for the second derivative. 
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where h(y) = R{y)/\y\ and h( 0) is defined to be 0, so that h is continuous at 0. 
The weak convergence hypothesis and the Continuous Mapping Theorem imply 
t„\X n — g\ has a limiting distribution. So, by Slutsky’s Theorem, it is enough to 
show h(X n — g) = op( 1). But, this follows by the Continuous Mapping Theorem 
as well. ■ 

Note that (11.26) and (11.27) remain true if g(g) = 0 with the interpreta¬ 
tion that the limit distribution places all its mass at zero, in which case we can 
conclude 

r n [g{X n ) - g(g)\ 4 0 . 

Example 11.2.9 (Binomial Variance) Suppose S„ is binomal based on n 
trials and success probability p. Let p n = S„/n. By the Central Limit Theorem, 

n 1/2 (p n — p) 4 N(0,p{l-p)) . 

Consider estimating g(p) = p( 1 — p). By the Delta Method, 

n 1/2 [g(p n ) - g(p)\ 4 1V(0, (1 - 2pfp(l - p)) . 

If p = 1/2, then <j(l/2) = 0, so that 

n 1/2 [g(p n ) - g(p)} 4 0 . 

In order to obtain a nondegenerate limit distribution in this case, note that 
n[g(Pn) ~\] = -\n 1 / 2 {p n - ^)] 2 • 

Therefore, by the Continuous Mapping Theorem, 

n[g{pn) - |] 4 —X 2 , 

where X is N( 0,1/4), or 

n[g(Pn) - |] 4 — jXi , 

where Xi i s a random variable distributed as Chi-squared with one degree of 
freedom. ■ 

In the case g(p) = 0, it is not surprising that the limit distribution is a multiple 
of a Chi-squared variable with one degree of freedom. Indeed, suppose k = 1 and 
g is twice differentiable at g with second derivative g"(g), so that 

g(x) = g{g) + ^g"(g)(x - gf + R(x - g) , 

where R(x — g) = o[(x — g) 2 ] as * —>■ g. Arguing as in the proof of Theorem 
11.2.14 yields 

r 2 [g(X n ) - g(g)j - r 2 ^-(X n - g) 2 = T 2 R(X n - g) = o P ( 1) (11.28) 

(Problem 11.46). By the Continuous Mapping Theorem, 

r n {X n -g) AX 
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implies 


2 P"(M)/ V n 

T n 0 \-X-n P) 


2 4 g ”( p ) x 2 


By Slutsky’s Theorem, T 2 [g(X n ) — g(/r)] has this same limiting distribution. Of 
course, if X is X(p, a 2 ), then this limiting distribution is 9 ^ <7 Xi- 


Example 11.2.10 (Sample Correlation) Let ( Ui,Vi ) be i.i.d. bivariate ran¬ 
dom vectors in the plane, with both Ui and Vi assumed to have finite nonzero 
variances. Let cry = Var(Ui), ay = Var(Vi), pu = E(Ui), pv = E(Vi) and 
let p = Cov(Ui,Vi)/(auffy) be the population correlation coefficient. The usual 
sample correlation coefficient is given by 


Pn — 


^(Ui-UnWi 

SuSy 


Vn)/n 


(11.29) 


where U n = J^Ui/n, V n = J2 v i/ n , ~ U n ) 2 /n and Sy = J2( v i ~ 

V„) 2 /n. Then, n 1 ' /2 (p rl — p) is asymptotically normal. The important observation 
is that p n is a smooth function of the vector of means X n , where X, is the vector 
Xi = (Ui, V, Uf, V 2 , UiVif. In fact, p n = g(X n ), where 


P((2/i,2/2,y3,2/4,2/5) T ) 


_ vs - y 12/2 _ 

(2/3 — y?) 1/2 (2/4 — 2/i) 1/2 


Note that g is smooth and g is readily computed. Let p = E(Xi) denote the 
mean vector. Further assume that Ui and V) have finite fourth moments. Then, 
by the multivariate CLT, 


n 1/2 (X n — /r) -4 X(0, E) , 


where E is the covariance matrix of X\. For example, the (1, 5) component of E 
is Cov(Ui, UiVi). Hence, by the delta method, 

n 1/2 [g(X n ) - g(p)] = n /2 (p n - p) 4 N(0,g(p)Y,g(p) T ) . (11.30) 

As an example, suppose that (Ui, Vi) is bivariate normal; in this case, (11.30) 
reduces to (Problem 11.47) 


n 1/2 (p„ - p) 4 X(0, (1 - p 2 ) 2 ) . (11.31) 

This implies (1 — p 2 ) 1 — p 2 . Then, by Slutsky’s theorem, 

n 1/2 (p n - p)/( 1 - Pn) 4- N( 0,1) , 


and so the confidence interval 

~ i -1/2 , r -2 i 

p„±n Z l-f (1 - Pn) 

is a pointwise asymptotically level 1 — a confidence interval for p. The error in this 
asymptotic approximation derives from both the normal approximation to the 
distribution of p n and the fact that one is approximating the limiting variance. To 
counter the second of these effects, the following variance stabilization technique 
can be used. By the delta method, if h is differentiable, then 

n 1/2 (h(pn) - h(p )] 4 N( 0, (h'(p)) 2 ( 1 - P 2 ) 2 ) . 
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The idea is to choose h so that the limiting variance does not depend on p 
and is a constant; such a transformation is then called a variance stabilizing 
transformation. The solution is known as Fisher’s 2 -transformation and is given 
by 

Kp) = l l°g(T-^) = arctanh(p) . 

2 1 — p 


Then, 


h(p n ) ± n 1/2 2i-f 

is a pointwise asymptotically level 1 — a confidence interval for h(p). The inverse 
function of h is the hyperbolic tangent function 


tanh(w) = h 1 (y)= ^ + , 

so that 

[tanh(arctanh(/3,i) — n~ 1,/2 2 i_ ), tanh(arctanh(p„) + n -1 ^ 2 2 i_“ )] (11.32) 

is also a pointwise asymptotically level 1 — a confidence interval for p. 4 ■ 


Sometimes, {.Y„} may not have a limiting distribution, but the weaker property 
of tightness may hold, which only requires that no probability escapes to ±oo. 


Definition 11.2.5 A sequence of random vectors {X n } is tight (or uniformly 
tight) if Ve > 0, there exists a constant B such that 

inf P{| A'„| < B} > 1 — e . 

n 

A bounded sequence of numbers {x n } is sometimes written x„ = 0(1); more 
generally x n = 0(y n ) if x n /yn = 0(1). If {A'„} is tight, we sometimes also say 
X„ is bounded in probability, and write \X n \ = Op(l). If X n is tight and Y n —»• 0 
(sometimes written Y n = op(l)), then lAnh^l —> 0 (Problem 11.55). The notation 
|A n | = Op(|y„|) means |An|/|Fn| is tight. 

Tightness of a sequence of random vectors in lR fc is equivalent to each of the 
component variables being tight IR (Problem 11.40). Note that tightness, like 
convergence in distribution, really refers to the sequence of laws of X n , denoted 
£(X n ). Thus, we shall interchangeably refer to tightness of a sequence of random 
variables or the sequence of their distributions. 

In a statistical context, suppose Xi,, X rl are i.i.d. according to a model 
{Pg , 9 € fl}. Recall that an estimator sequence T„ is a (weakly) consistent 
estimator of g(9 ) if, for every 0 € f2, 

T n - g{9) ^ 0 


4 For discussion of this transformation, see Mudholkar (1983), Stuart and Ord, Vol. 1 
(1987) and Efron and Tibshirani (1993), p.54. Numerical evidence supports replacing n 
by n — 3 in (11.32). 
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in probability when Pg is true. An estimator sequence T„ is said to be 
T„-consistent for g(6 ) if, for every 6 £ Q, 

Tn[Tn - g{6)] 

is tight when Pg is true. For example, if the underlying population has a finite 
variance, it follows from the Central Limit Theorem that the sample mean is a 
n 1 ^ 2 -consistent estimator of the population mean. 

Whenever A n converges in distribution to a limit distribution, then { X n } is 
tight, and the following partial converse is true. Just as any bounded sequence of 
real numbers has a subsequence which converges, so does any sequence of random 
variables X n that is Op( 1). This important result is stated next. 


Theorem 11.2.15 (Prohorov’s Theorem) Suppose {A n } is tight on IR fc . 
Then, there exists a subsequence nj and a random vector X such that X n . —> X. 


11.2.3 Almost Sure Convergence 

On occasion, we shall utilize a form of convergence of X n to A' stronger than 
convergence in probability. 


Definition 11.2.6 Suppose X„ and A' are random vectors in IR fc , defined on 
a common probability space (X,tF). Then, X n is said to converge almost surely 
(a.s.) to X if X n (u)) —» X(u>) on a set of points u> which has probability one; that 
is, if 

P{lo 6 X : lim \X „(w) - A(w)| = 0} = 1 . 

n—too 

This is denoted by A n —> X a.s.. 

Equivalently, we say that X„ converges to X with probability one, since there 
is a set of outcomes w having probability one such that X n (u>) —» A (aIf A„ 
converges almost surely to X, then X n converges in probability to X, but the 
converse is false (but see Problem 11.61). Indeed, convergence in probability does 
not even guarantee X n (ui) —> A(w) for any outcome w. The following provides a 
classic counterexample. 


Example 11.2.11 (Convergence in probability, but not a.s.) Suppose U 
is uniformly distributed on [0,1), so that X is [0,1), F is the class of Borel sets, 
U = U(ui) = u>, and P is the uniform probability measure. For m = 1,2,... 
and j = 1 let Y m j be one if U £ [(j — 1 )/m,j/m) and zero other¬ 

wise. For any m, exactly one of the Y m , : j is one and the rest are zero; also, 
P{Ym,j = 1} = 1/m —> 0 as m —> oo. String together all the variables so that 
AT = Yr, AT = Y 2 ,i, A 3 = Y 2 , 2 , A 4 = Y 3j1 , A 5 = Y 3 , 2 , etc. Then, A n 0 in 
probability. But X n does not converge to 0 for any outcome U since A n oscillates 
infinitely often between 0 and 1. ■ 
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Theorem 11.2.16 (Strong Law of Large Numbers) Let X. t be i.i.d. real¬ 
valued random variables with mean g. Then 

n 

X n = ^ 'y ' Xi —g a.s. 

i=1 

Conversely, if X n —¥ g, a.s. with |/r| < oo, then E|Xi| < oo. 

In a statistical context, suppose X\,...,X n are i.i.d. according to a model 
{Pe, 0 € fi}. Suppose, under each 9 , T n = T n (Xi ,..., X„) converges almost 
surely to g(9). Then, T n is said to be strongly consistent estimator of g(9). 

One of the most fundamental examples of almost sure convergence is provided 
by the Glivenko-Cantelli theorem. To state the result, first define the Kolmogorov- 
Snrirnov distance between c.d.f.s F and G as 

d K (F, G ) = sup |F(f) - G(t) | . (11.33) 

t 


Theorem 11.2.17 (Glivenko-Cantelli Theorem) Suppose X \,..., X n are 

i.i.d. real-valued random variables with c.d.f. F. Let F n be the empirical c.d.f. 
defined by 

1 n 

F n (t) =-Y I{Xi<t} . (11.34) 

n z —' 

i =1 


Then, 


dK(F n ,F) —> 0 a.s. 


To prove the Glivenko-Cantelli Theorem, note that, for every fixed t, F n (t) —> 
F(t) almost surely, by the Strong Law of Large Numbers. That this convergence 
is uniform in t follows from the fact that F is monotone (Problem 11.53). 


Example 11.2.12 (Kolmogorov-Smirnov Test) The Glivenko-Cantelli The¬ 
orem 11.2.17 forms the basis for the Kolmogorov-Smirnov goodness of fit test, 
previously introduced in Section 6.13. Specifically, consider the problem of testing 
the simple null hypothesis that F = Fq versus F ^ Fq. The Glivenko-Cantelli 
Theorem implies that, under F, 

dK{F n , Fo) —> dn(F, Fo) a.s. 

(and hence in probability as well), where the right side is zero if and only if 
F = Fo- Thus, the statistic dx(E„, Fo) tends to be small under the null hy¬ 
pothesis and large under the alternative. In order for this statistic to have a 
nondegenerate limit distribution under Fo, we normalize by multiplication of 
n 1//2 and the Kolmogorov-Smirnov goodness of fit test statistic is given by 

T n = sup n 1/2 \F n (t) - F 0 (t)\=n 1/2 d K {F n ,F 0 ) . (11.35) 

teIR 

The Kolmogorov-Smirnov test rejects the null hypothesis if T n > s n ,i- a , where 
s n ,i- a is the 1 — a quantile of the null distribution of T n when Fo is the uniform 
U(0, 1) distribution. Recall from Section 6.13 that the finite sampling distribution 
of T n under Fo is the same for all continuous Fq (also see Problem 11.57), but its 
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exact form is difficult to express. Some approaches to obtaining this distribution 
are discussed in Durbin (1973) and Section 4.3 of Gibbons and Chakraborti 
(1992). Values for have been tabled in Birnbaum (1952). For exact power 

calculations in both the continuous and discrete case, see Niederhausen (1981) 
and Gleser (1985). 

By the duality of tests and confidence regions, the Kolmogorov-Smirnov test 
can be inverted to yield uniform confidence bands for F , given by 

Rn,i-a = {F : n 1/2 sup \F n (t) - F(t)\ < s„,i- Q } . (11.36) 

t 

By construction, Pf{F £ R n ,i-a} = 1 — cr if F is continuous; furthermore, the 
confidence band is conservative if F is not continuous (Problem 11.58). 

The limiting behavior of T„ will be discussed in Section 14.2. In fact, when 
F = Fo, T n has a continuous strictly increasing limiting distribution with 1 — a 
quantile si_ a (and so s„,i- a —> si-a). It follows that the width of the band 
(11.36) is 0(n _1//2 ). Alternatives to the Kolmogorov-Smirnov bands that are 
more narrow in the tails and wider in the middle are discussed in Owen (1995). ■ 

The following useful inequality, which holds for finite sample sizes, actually 
implies the Glivenko-Cantelli Theorem (Problem 11.59). 

Theorem 11.2.18 (Dvoretzky, Kiefer, Wolfowitz Inequality) Suppose 
Xi ,..., X n are i.i.d. real-valued random variables with c.d.f. F. Let F n be the 
empirical c.d.f. (11.34). Then, for any d > 0 and any positive integer n, 

P{dK{F n , F) > d} < Cexp(— 2nd 2 ) , (11.37) 

where C is a universal constant. 

Massart (1990) shows that we can take (7 = 2, which greatly improves the 
original value obtained by Dvoretzky, Kiefer, and Wolfowitz (1956). 

Example 11.2.13 (Monte Carlo Simulation) Suppose Xi,... ,X n are i.i.d. 

observations with common distribution P. Assume P is known. The prob¬ 
lem is to determine the distribution or quantile of some real-valued statistic 
T n (X i,..., X n ) for a fixed finite sample size n. Denote this distribution by J n (t), 
so that 

Jn(t) = P{T n (Xi,..., X n ) < t} . 

This distribution may not have a tractable form or may not be explicitly com¬ 
putable, but the following simulation scheme allows the distribution J(t) to be 
estimated to any desired level of accuracy. For j = 1,..., B, let Xj } i,..., Xj : „ be 
a sample of size n from P; then, one simply evaluates T n (X :h \,..., X j,n), and the 
empirical distribution of these B values serves as an approximation to the true 
sampling distribution J„(t). Specifically, J„(t) is approximated by 

B 

J n , B (t) = B” 1 Y, I i Tn ( X j < 1. ■ ' ' . X Ln) < t} ■ 

1=1 

For large B, J n ,B(t) will be a good approximation to the true sampling distribu¬ 
tion J n (t, P ). One (though perhaps crude) way of quantifying the closeness of this 
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approximation is the following. By the Dvoretsky, Kiefer, Wolfowitz inequality 
(11.37) (with B now taking over the role of n), there exists a universal constant 
C so that 

P{d,K{Jn,B,J n ) > d} < C exp(—2 Bd 2 ). 

Hence, if we desire the probability of the supremum distance between J„,s(-) and 
J n (-,P) to be greater than d with probability less than e, all we need to do is 
ensure that B is large enough so that Cexp(— 2Bd 2 ) < e. Since B , the number 
of simulations, is determined by the statistician (assuming enough computing 
power), the desired accuracy can be obtained. Further results on the choice of B 
are given in Jockel (1986). 

Here, we are tacitly assuming that one can easily accomplish the sampling of 
observations from P. Of course, when P corresponds to a cumulative distribution 
function F on the real line, one can usually just obtain observations from F by 
E _1 (t/), where U is a random variable having the uniform distribution on (0,1). 
This construction assumes an ability to calculate an inverse function F’~ 1 (-). A 
sample Xj : i,..., Xj, n of n i.i.d. F variables can then be obtained from n i.i.d. 
Uniform (0,1) observations Uj, i, ..., Uj, n by the prescription Xj t „ = F 1 ” 1 If 

F is not tractable, other methods for generating observations with prescribed 
distributions are available in statistical software packages, such as S-plus, Excel, 
or Maple. 

Note, however, that we have ignored any error from the use of a pseudo¬ 
random number generator, which presumably would be needed to generate the 
Uniform (0,1) variables. The above idea forms the basis of many approximation 
schemes; for some general references on Monte Carlo simulation, see Devroye 
(1986) and Ripley (1987). ■ 

Almost sure convergence is the strongest type of convergence we have intro¬ 
duced and it has many consequences. For example, suppose X n —> X almost 
surely and \X n \ < 1 with probability one. Then, |X| < 1 with probability one, 
and so E(\X\) < 1; by the Lebesgue dominated convergence Theorem (Theorem 

2.2.2) , it follows that E(X n ) —> E(X). If the assumption that X n —t X almost 
surely is replaced by the weaker condition that X n converges in distribution to 
X, then the argument to show E(X n ) —> E(X) breaks down. However, we shall 
now show that the result continues to hold since the conclusion pertains only to 
distributional properties of X n and X. The argument is based on the following 
theorem. 

Theorem 11.2.19 (Almost Sure Representation Theorem) Suppose X n —> 
X in IR fc . Then, there exist random vectors X n and X defined on some common 
probability space such that X n has the same distribution as X n and X„ —» A' a.s. 
(and so X has the same distribution as X). 

Example 11.2.14 (Convergence of Moments) Suppose X n and A' are real¬ 
valued random variables and X n -4- A. If the X n are uniformly bounded, then 
E(X n ) E(X). To see why, construct X n and X by the Almost Sure Represen¬ 
tation Theorem and then apply the Dominated Convergence Theorem (Theorem 

2.2.2) to the X n to conclude 

E(X n ) = E(Xn) -*■ E( A) = E{X) . 


(11.38) 
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If the X n are not uniformly bounded, but X„ > 0, then by Fatou’s Lemma 
(Theorem 2.2.1), we may conclude 

E(X) = E(X) < liminf E(X n ) = limiiif E(X n ) . 

n n 

As a final result, suppose X n -4 X and |A'| has distribution F which is 
continuous at t. Then, by the Continuous Mapping Theorem, 

\X n \I{\X n \<t}A\X\I{\X\<t} . 

By (11.38), we may conclude 

E[\X n \I{\X n \ < t}] —1 E[\X\I{\X\ < £}] . (11.39) 

If, in addition, E\X n \ —> .E|X|, then 

E[\X n \I{\X n \ > t}] -»■ E[\X\I{\X\ >£}].■ (11.40) 

11.3 Robustness of Some Classical Tests 

Optimality theory postulates a statistical model and then attempts to determine 
a best procedure for that model. Since model assumptions tend to be unreliable, 
it is necessary to go a step further and ask how sensitive the procedure and its 
optimality are to the assumptions. In the normal models of Chapters 4-7, three 
assumptions are made: independence, identity of distribution, and normality. In 
the two-sample t-test, there is the additional assumption of equality of variance. 
We shall consider the effects of nonnormality and inequality of variance in the 
first subsection, and that of dependence in the next subsection. 

The natural first question to ask about the robustness of a test concerns the 
behavior of the significance level. If an assumption is violated, is the significance 
level still approximately valid? Such questions are typically answered by combin¬ 
ing two methods of attack: The actual significance level under some alternative 
distribution is either calculated exactly or, more usually, estimated by simulation. 
In addition, asymptotic results are obtained which provide approximations to the 
true significance level for a wide variety of models. We here restrict ourselves to 
a brief sketch of the latter approach. 


11.3.1 Effect of Distribution 


Consider the one-sample problem where X \,..., X n are independently dis¬ 
tributed as N(£,a 2 ). Tests of H : £ = £o are based on the test statistic 


tn — tn {Xl , . . . , X n ) — 


\3i(X n £o) 

Sn 



(11.41) 


where S 2 = ff(Xi — Xn ) 2 /{n — 1); see Section 5.2. When £ = £o and the X’s 
are normal, t n has the t-distribution with n — 1 degrees of freedom. Suppose, 
however, that the normality assumption fails and the X’s instead are distributed 
according to some other distribution F with mean £o and finite variance. Then by 
the Central Limit Theorem, \Jn{X n — ?o)/o' has the limit distribution N( 0,1); 
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furthermore S n /cr tends to 1 in probability by Example 11.2.6. Therefore, by 
Slutsky’s theorem, t„ has the limit distribution 1V(0,1) regardless of F. This 
shows in particular that the t-distribution with n — 1 degrees of freedom tends 
to IV(0,1) as n —» oo. 

To be specific, consider the one-sided t-test which rejects when t n > 
where tn~i,i~ a is the 1 — a quantile of the t-distribution with n — 1 degrees of 
freedom. It follows from Corollary 11.2.3 and the asymptotic normality of the 
t-distribution that (see Problem 11.42 (ii)) 

tn — 1,1 — a ^ Z\—oi 4* (1 Oi) . 

In fact, the difference tn-i,i-a — z\- a is 0(n -1 ), as will be seen in Section 11.4.1. 

Let an(F) be the true probability of the rejection region t„ > when 

the distribution of the X’s is F. Then a n (F) = Pi?{t n > t„_i,i- a } has the same 
limit as -P<i>{fn > zi- a }, which is a. Thus, the f-test is pointwise asymptotically 
level a, assuming the underlying distribution has a finite nonzero variance. How¬ 
ever, the t-test is not uniformly asymptotically level a. This issue will be studied 
more closely in Section 11.4. For sufficiently large n, the actual rejection prob¬ 
ability a n (F) will be close to the nominal level a ; how close depends on F and 
n. For entries to the literature dealing with this dependence, see Cressie (1980), 
Tan (1982), Benjamini (1983), and Edelman (1990). Other robust approaches for 
testing the mean are discussed in Sutton (1993) and Chen (1995). The use of 
resampling will be deferred to Chapter 15. 

To study the corresponding test of variance, suppose first that the mean £ is 0. 
When F is normal, the UMP test of ff : a = a o against a > ao rejects when 
5 ZXf/o-Q is too large, where the null distribution of "Y1,X 2 /uq is Xn- By the 
Central Limit theorem, X 2 — na 2 )/n tends in law to N( 0, 2oo) as n —> oo, 

since Var(Xf) = 2crQ. If the rejection region is written as 

Y,Xf~na 2 0 

signal ~ ' 

it follows that C n —> Z\- a . 

Suppose now instead that the X’s are distributed according to a distribution 
F with E(Xi) = 0, E(X ?) = Vcir(Xi) = a 2 , and Var{X 2 ) = 7 2 . Then £( Xf - 
n <Jo)/Vn tends in law to N(0,y 2 ) when a = a o, and the rejection probability 
otn(F) of the test tends to 


limP 


J 2 AT — nap 

y/2naQ 


> Zl- a 


= 1-4- 


Zl — a ^2(Tq 


Depending on 7 , which can take on any positive value, the sequence a n (F) can 
thus tend to any limit < \. Even asymptotically and under rather small depar¬ 
tures from normality (if they lead to big changes in 7 ), the size of the x 2 _ t es t is 
thus completely uncontrolled. 

For sufficiently large n, the difficulty can be overcome by Studentization 5 , 
where one divides the test statistic by a consistent estimate of the asymptotic 
standard deviation. Letting Yi = X 2 and E(Yi ) = rj = a 2 , the test statistic 
then reduces to \/n(Y — r/o). To obtain an asymptotically valid test, it is only 


5 Studentization is defined in a more general context at the end of Section 7.3. 



446 11. Basic Large Sample Theory 


necessary to divide by a suitable estimator of y/VarYi such as VW - Y) 2 /n. 
(However, since Y 2 = Xf , small changes in the tail of Xi may have large effects 
on Y 2 , and n may have to be rather large for the asymptotic result to give a 
good approximation.) 

When £ is unknown, the normal theory test for a 2 is based on 7",( X t — A' n ) 2 , 
and the sequence 


[£(* - A n ) 2 - na 2 ] = ^ (E - ™o) 


y/n 


~^=nX 2 

yjn 


again has the limit distribution N(0, j 2 ). To see this, note that the distribution 
of ~^2,{Xi — A n ) 2 is independent of £ and put £ = 0. Since yfnX has a (normal) 
limit distribution, nX 2 is bounded in probability and so n.X 2 /y/n tends to zero 
in probability. The result now follows from that for £ = 0 and Slutsky’s theorem. 

The above results carry over to the corresponding two-sample problems that 
were considered in Section 5.3. Consider the two-sample t-statistic given by 
(5.28). An extension of the one-sample argument shows that as m, n — > oo, 
(Y n — X m )/ay/1 /m + 1/n tends in law to N( 0,1) while [J2(X-i — X m ) 2 + J20 / 'j ~ 
Yn) 2 ]/{m + n — 2)a 2 tends in probability to 1 for samples Xi,... ,X m ; Yi,...,Y n 
from any common distribution F with finite variance. Thus, the rejection prob¬ 
ability am,n{F) tends to a for any such F. As will be seen in Section 11.3.3, the 
same robustness property for the UMP invariant test of equality of s means also 
holds. 

On the other hand, the T-test for variances, just like the one-sample x 2 _ test, 
is extremely sensitive to the assumption of normality. To see this, express the 
rejection region in terms of log Sy — log S\, where Sx = ^2(Xi — X m ) 2 /{m — 1 ) 
and Sy = ]T(Yj- — Y n ) 2 /{n — 1), and suppose that as m and n —» oo, m/(m + n) 
remains fixed at p. By the result for the one-sample problem and the delta method 
with g(u ) = log u (Theorem 11.2.14), it is seen that y / m[logS'x — logo- 2 ] and 
y/n[logSy — logo 2 ] both tend in law to N( 0 , 7 2 /o- 4 ) when the X’s and Y’s are 
distributed as F, and hence that y/m + n[log Sy — logS'x] tends in law to the 
normal distribution with mean 0 and variance 


\p 1 -p) p(l-p)o 4 


In the particular case that F is normal, y 2 = 2o - 4 and the variance of the limit 
distribution is 2/p(l — p). For other distributions 7 2 /o 4 can take on any positive 
value and, as in the one-sample case, a n (F) can tend to any limit less than |. [For 
an entry into the extensive literature on more robust alternatives, see for example 
Conover, Johnson, and Johnson (1981), Tiku and Balakrishnan (1984), Boos and 
Brownie (1989), Baker (1995), Hall and Padmanabhan (1997), and Section 2.10 
of Hettmansperger and McKean (1998).] 

Having found that the rejection probability of the one- and two-sample f-tests 
is relatively insensitive to nonnormality (at least for large samples), let us turn 
to the corresponding question concerning the power of these tests. By similar 
asymptotic calculations, it can be shown that the same conclusion holds: Power 
values of the t-tests obtained under normality are asymptotically valid also for 
all other distributions with finite variance. This is a useful result if it has been 
decided to employ a t-test and one wishes to know what power it will have against 
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a given alternative £/cr or (77 — £)/cr, or what sample sizes are required to obtain 
a given power. 

Recall that there exists a modification of the f-test, the permutation version of 
the f-test discussed in Section 5.9, whose size is independent of F not only asymp¬ 
totically but exactly. Moreover, we will see in Section 15.2 that its asymptotic 
power is equal to that of the t- test. It may seem that the permutation f-test has 
all the properties one could hope for. However, this overlooks the basic question 
of whether the f-test itself, which is optimal under normality, will retain a high 
standing with respect to its competitors under other distributions. The f-tests 
are in fact not robust in this sense. Some tests which are preferable when a broad 
spectrum of distributions F is considered possible were discussed in Section 6.9. 
A permutation test with this property has been proposed by Lambert (1985). 

As a last problem, consider the level of the two-sample f-test when the variances 
Var(Xi) = a 2 and Var(Y)) = r 2 may differ (as in the Behrens-Fisher problem), 
and the assumption of normality may fail as well. As before, one finds that 
(Y m ~ X n )/^/ m + t 2 jn tends in law to N(0, 1 ) as m, n — > 00 , while S\ = 
5 2(Xi — X m ) 2 /(m — 1) and Sy = 12(Yi — Y n ) 2 /(n — 1) respectively tend to 
a 2 and r 2 in probability. If m and n tend to 00 through a sequence with fixed 
proportion m/(m + n) = p, the squared denominator of the f-statistic, 


D 2 


m- 1 2 . n -1 2 

-;-o' 5 A' H -;-I 

m+n—2 m + n—2 


tends in probability to pa 2 (1 — p)r 2 , and the limit of 



is normal with mean zero and variance 

(1 - p)a 2 +pr 2 
pa 2 + (1 — p)r 2 


(11.42) 


When m = n, so that p = |, the f-test thus has approximately the right level 
even if a and r are far apart. The accuracy of this approximation for different 
values of m = n and t/<t is discussed by Ramsey ( 1980 ) and Posten, Yeh, and 
Owen ( 1982 ). However, when p ^ |, the actual size of the test can differ greatly 
from the nominal level a even for large m and n. An approximate test of the 
hypothesis H : 77 = £ when a, r are not assumed equal, which asymptotically is 
free of this difficulty, can be obtained through Studentization, i.e., by replacing 
D 2 with (l/m)Sx + (1 /n)Sy and referring the resulting statistic to the standard 
normal distribution. This approximation is very crude, and not reliable unless 
m and n are fairly large. A refinement, the Welch approximate t-test, refers the 
resulting statistic not to the standard normal but to the f-distribution with a 
random number of degrees of freedom / given by 

1 _ / R V 1 1 1 

f ~ \1 + RJ m - 1 + (1 + R) 2 ' n - 1 ’ 
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where R = (nSx) / (mSy)- 6 When the X’s and Y ’s are normal, the actual level 
of this test has been shown to be quite close to the nominal level for sample sizes 
as small as m = 4, n = 8 and m = n — 6 [see Wang (1971)]. A further refinement 
will be mentioned in Section 15.6. A simple but crude approach that controls the 
level is to use as degrees of freedom the smaller of n — 1 and m — 1, as remarked 
by Scheffe (1970). 

The robustness of the level of Welch’s test against nonnormality is studied by 
Yuen (1974), who shows that for heavy-tailed distributions the actual level tends 
to be considerably smaller than the nominal level (which leads to an undesirable 
loss of power), and who proposes an alternative. Some additional results are 
discussed in Scheffe (1970) and in Tiku and Singh (1981). The robustness of 
some quite different competitors of the t-test is investigated in Pratt (1964). 

For testing the equality of s normal means with s > 2, the classical test based 
on the T-statistic (7.19) is not robust, even if all the observations are normally 
distributed, regardless of the sample sizes (Scheffe (1959), Problem 11.86); again, 
the problem is due to the assumption of a common variance. More appropriate 
test for this generalized Behrens-Fisher problem have been proposed by Welch 
(1951), James (1951), and Brown and Forsythe (1974a), and are further dis¬ 
cussed by Clinch and Kesselman (1982), Hettmansperger and McKean (1998) 
and Chapter 10 of Pesarin (2001). The corresponding robustness problem for 
more general linear hypotheses is treated by James (1954) and Johansen (1980); 
see also Rothenberg (1984). 


11.3.2 Effect of Dependence 

The one-sample t-test arises when a sequence of measurements Xi,... ,X n , is 
taken of a quantity £, and the X’s are assumed to be independently distributed 
as N(£, a 2 ). The effect of nonnormality on the level of the test was discussed in the 
preceding subsection. Independence may seem like a more innocuous assumption. 
However, it has been found that observations occurring close in time or space are 
often positively correlated [Student (1927), Hotelling (1961), Cochran (1968)]. 
The present section will therefore be concerned with the effect of this type of 
dependence. 


Lemma 11.3.1 Let X\,... ,X n be jointly normally distributed with common 
marginal distribution N( 0, a 2 ) and with correlation coefficients = corr (Xi,Xj) 
Assume that 

(11-43) 


and 


EEfti 


(11.44) 


as n —» oo. Then, 


6 For a variant see Fenstad (1983). 
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(i) the distribution of the t-statistic t n defined in equation (ll.fl) (with (o = 0) 
tends to the normal distribution N(0 ,1 + 7 ); 

(ii) i /7 ^ 0 , the level of the t-test is not robust even asymptotically as n —» 00 . 
Specifically, if 7 > 0, the asymptotic level of the t-test carried out at nominal 
level a is 


Proof, (i): Since the X t are jointly normal, the numerator y/nX n of t n is also 
normal, with mean zero and variance 


Var I vnA') = a 


1 + U Pi’i 




■ cr 2 (l + 7 ) 


(11.45) 


and hence tends in law to Af(0,cr 2 (l + 7 )). The denominator of t n is the square 
root of 


S ' 2 — 


— V v 2 - n 

- 1 ‘ n - 1 


xi 


— — p 

By (11.45), Var( A'„) 0 and so X n —> 0. A calculation similar to (11.45) 

shows that V ar(n _1 5 ^ 7=1 X?) 0 (Problem 11.65). Thus, n _1 X]"=i A 2 ~* 0-2 

p 

and so S n —1 a. By Slutsky’s theorem, the distribution of t n therefore tends to 
JV(0,1+7)- 

The implications (ii) are obvious. ■ 


Under the assumptions of Lemma 11.3.1, the joint distribution of the X's is 
determined by o 2 and the correlation coefficients pij, with the asymptotic level 
of the t-test depending only on 7 . The following examples illustrating different 
correlation structures show that even under rather weak dependence of the ob¬ 
servations, the assumptions of Lemma 11.3.1 are satisfied with 7 ^ 0, and hence 
that the level of the t -test is quite sensitive to the assumption of independence. 

Model A. (Cluster Sampling). Suppose the observations occur in s 
groups (or clusters) of size m, and that any two observations within a group 
have a common correlation coefficient p, while those in different groups are in¬ 
dependent. (This may be the case, for instance, when the observations within a 
group are those taken on the same day or by the same observer, or involve some 
other common factor.) Then (Problem 11.67), 

2 

Var(X) = —[1 + (m — l)p] , 
ms 

which tends to zero as s —» 00 . The conditions of the lemma hold with 7 = 
(m — l)p, and the level of the t-test is not asymptotically robust as s — ¥ 00 . In 
particular, the test overstates the significance of the results when p > 0 . 

To provide a specific structure leading to this model, denote the observations 
in the ith group by Xij (j = 1 and suppose that Xfj = Ai + Ui : j, 

where Ai is a factor common to the observations in the ith group. If the A’s and 
U’s (none of which are observable) are all independent with normal distributions 
1V(£, a\) and N(0, <tq) respectively, then the joint distribution of the X’s is that 
prescribed by Model A with o 2 = a\ + uq and p = a 2 A /a 2 . 
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Model B. (Moving-Average Process). When the dependence of nearby 
observations is not due to grouping as in Model A, it is often reasonable to assume 
that pip depends only on | j — i\ and is nonincreasing in | j — i\. Let pip+k then 
be denoted by pk , and suppose that the correlation between X, and Xi+k is 
negligible for k > m (m an integer < n), so that one can put pk = 0 for k > m. 
Then the conditions for Lemma 11.3.1 are satisfied with 

m 

7 = 2 P k ■ 

k =1 

In particular, if pi, ..., p m are all positive, the f-test is again too liberal. 

A specific structure leading to Model B is given by the moving-average process 


m 

x i =z+j2fc u i+i, 

3=0 

where the I/’s are independent N(0,ag). The variance a 2 of the X’s is then 
n 2 = n 0 2 £”l 0 $and 


X) PiPi+k 

i =0_ 


i=0 

for 

k < m 

£ A 2 



3=0 3 



0 

for 

k > m 


Model C. (First-Order Autoregressive Process). A simple model 
for dependence in which the \pk\ are decreasing in k but / 0 for all k is the 
first-order autoregressive process defined by 

AL+i = £ + //(AT — £) + Ui+i, |/?| < 1, i = 

with the Ui independent /V( 0 ,< 7 q). If AT is /V(£, t 2 ), the marginal distribution of 
Xi for i > 1 is normal with mean f and variance a 2 = (3 2 a 2 _i + a 2 . The variance 
of Xi will thus be independent of i provided r 2 = oq/( l — fi 2 ). For the sake of 
simplicity we shall assume this to be the case, and take £ to be zero. From 

X i+k = f3 k Xi + p^Ui+i + 0 k - 2 U i+ 2 + ■ ■ ■ + m+k-i + Ui+k 

it then follows that pk = /3 k , so that the correlation between AT and X j decreases 
exponentially with increasing | j — i\. The assumptions of Lemma 11.3.1 are again 
satisfied, and 7 = 2/3/(1 — (3). Thus, in this case too, the level of the t-test is 
not asymptotically robust. [Some values of the actual asymptotic level when the 
nominal level is .05 or .01 are given by Gastwirth and Rubin (1971).] 

It is seen that in general the effect of dependence on the level of the f-test is 
more serious than that of nonnormality. In order to robustify the test against gen¬ 
eral dependence through studentization (as was done in the two-sample case with 
unequal variances), it is necessary to consistently estimate 7 , which implicitly de¬ 
pends on estimation of all the pij . Unfortunately, the number of parameters pip 
exceeds the number of observations. However, robustification is possible against 
some types of dependence. For example, it may be reasonable to assume a model 
such as A-C so that it is only required to estimate a reduced number of corre- 
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lations.' Some specific procedures of this type are discussed by Albers (1978), 
[and for an associated sign test by Falk and Kohne (1984)]. Such robust proce¬ 
dures will in fact often also be insensitive to the assumption of normality, as can 
be shown by appealing to an appropriate Central Limit Theorem for dependent 
variables [see e.g. Billingsley (1995, Section 27)]. The validity of these procedures 
is of course limited to the particular model assumed, including the value of a 
parameter such as m in Models A and B. In fact, robustification is achievable for 
fairly general classes of models with dependence by using an appropriate boot¬ 
strap method; see Problem 15.33 and Lahiri (2003). Alternatively, one can use 
subsampling, as in Romano and Thombs (1996); see Section 15.7. 

The results of the present section easily extend to the case of the two-sample 
f-test, when each of the two series of observations shows dependence of the kind 
considered here. 


11.3.3 Robustness in Linear Models 

In this section, we consider the large sample robustness properties of some of the 
linear model tests discussed in Chapter 7. As in Section 11.3.1, we focus on the 
effect of distribution. 

A large class of these testing situations is covered by the following general 
model, which was discussed in Problem 7.8. Let X \,..., X n be independent with 
E(Xi) = £; and Var(Xi) = a 2 < oo, where we assume the vector £ to lie in 
an s-dimensional subspace Iln of 1R", defined by the following parametric set of 
equations 

S 

C - 53 n. (11.46) 

l=i 

Here the aij are known coefficients and the /3j are unknown parameters. In matrix 
form, the n x 1 vector £ with ith component £* satisfies £ = A/3, where A is an 
n x s matrix having (i,j) entry atj and (3 is an s x 1 vector with jth component 
(3j. It is assumed A is known and of rank s. In the asymptotics below, the ai,j 
may depend on n, but s remains fixed. Throughout, the notation will suppress 
this dependence on n. 

The least squares estimators £i,..., of £i,... ,£n are defined as the values 
of minimizing 

n 

i= 1 

subject to £ £ Iln, where Iln is the space spanned by the s columns of A. Cor¬ 
respondingly, the least squares estimators (3i,... ,/3 s of j3 \,... ,/3 s are the values 
of /3j minimizing 

n s 

_ 53 ai o0j) ■ 

i= 1 1=1 


7 Models of a sequence of dependent observations with various covariance structures 
are discussed in books on time series such as Brockwell and Davis (1991), Hamilton 
(1994) or Fuller (1996). 
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By taking partial derivatives of of this last expression with respect to the (3j, it 
is seen that that (3j are solutions of the equations 

A t A/ 3 = A t X 

and so 

$=(A T Ay 1 A T X . 

(The fact that A T A is nonsingular follows from Problem 6.3.) Thus, 

I = PX , 

where 

P = A(A T A)~ 1 A T . (11.47) 

In fact, £ is the projection of X into the space Iln. (These estimators formed the 
basis of optimal invariant tests studied in Chapter 7.) Some basic properties of 
P and £ are recorded in the following lemma. 

Lemma 11.3.2 (i) The matrix P defined by (11.47) is symmetric (P = P T ) 
and idempotent (P 2 = P). 

(ii) X — £ is orthogonal to i); that is, 

i T ( X -0 = 0. 

Proof. The proof of (i) follows by matrix algebra (Problem 11.71). To prove (ii), 
note that 

i T (X - 0 = ( PX) T {X - PX) = X T P T (X - PX) 

= X T P T X - X T P T PX = 0 , 
since by (i) P T P = P T . ■ 

Note that (5j is a linear combination of the Xi. Thus, if the X t are normally 
distributed, so are the j3 y . Without the assumption of normality, the asymptotic 
normality of (5j can be established by the following lemma, which can be obtained 
as a consequence of the Lindeberg Central Limit Theorem (Problem 11.72). 

Lemma 11.3.3 Let Yi, Y%,... be independently identically distributed with mean 
zero and finite variance o 2 . (i) Let ci, ci,... be a sequence of constants. Then a 
sufficient condition for c i^V\/X] c ? to tend in law to N(0,a 2 ) is that 

max c 2 

-—- y 0 as n —» oo . (11.48) 

E r C? 

j=a 

(ii) More generally, suppose C n ,i,..., C n , n is a sequence of random variables, 
independent ofYi ,..., Y n . Then, a sufficient condition for X]"=i Cn,» Yi / -i/S i 
to tend in law to N(0,o 2 ) is 

max C 2 i 

i=l,...,n ’ P 

—-- > 0 as n —>■ oo . 
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The condition (11.48) prevents the c’s from increasing so fast that the last 
term essentially dominates the sum, in which case there is no reason to expect 
asymptotic normality. 


Example 11.3.1 Suppose t/i, C/ 2 , - - - are i.i.d. with mean 0 and finite nonzero 
variance a 2 . Consider the simple regression model 


X-i — (X + pti + Ui , 


where the U are known and not all equal. The least squares estimator p of p 
satisfies 


-a~ PU)(U - t) 

E {u-t) 2 


By Lemma 11.3.3, 


(P - P)VJ2(u -t) 2 

a 


4 N( 0,1) 


provided 


max(t; — t) 2 

E itj-t) 2 


(11.49) 


Condition (11.49) holds in the case of equal spacing ti = a + iA, but not when 
the t’s grow exponentially, for example, when ti = 2‘ (Problem 11.73). ■ 


Consider the hypothesis 

S 

H : 9 = 0 , (11.50) 

1=1 

where the 6’s are known constants with E^j = 1- Assume without loss of gener¬ 
ality that A T A = I , the identity matrix, so that the columns of A are mutually 
orthogonal and of length one. The least squares estimator of 9 is given by 

s n 

9 = Y1 diXi ’ (H-51) 

j=l i= 1 

where by (11.46) 


di = y S~\a,i'jbj (11.52) 

3 =1 

(Problem 11.74). By the orthogonality of A, ^2 d 2 = ^b 2 = 1, so that under H , 

S S 

E(6)=J2 E (bifr) = J2 b ih= 0 

3 =1 3 =1 


Var(9 ) = Var(^^ diXp — a 2 d 2 = a 2 . 

i=1 i= 1 


and 
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Consider the uniformly most powerful invariant test that rejects H when the 
t-statistic 

E > C . (11.53) 

\/E ( x i-£i) 2 /(n-s) 

Now, the denominator of (11.53) tends in probability to a. To see why, with s 
fixed, it suffices to show 

^E(^-^) 2 4a 2 . 

But, the left side is 

E (Xj - &) 2 2E(AL - &)(& - ji) + Ete - £i) 2 

n n n 

The first term tends in probability to a 2 , by the Weak Law of Large Numbers. By 
the Cauchy-Schwarz Inequality, half the middle term is bounded by the square 
root of the product of the first and third terms. Therefore, it suffices to show the 
third term tends to 0 in probability. Since this term is nonnegative, it suffices to 
show its expectation tends to 0, by Markov’s Inequality (Problem 11.26). But its 
expectation is the trace of the covariance matrix of £ divided by n. Letting /„ 
denote the n x n identity matrix, the covariance matrix of £ = PX is 

2 T TjT 2 n rjT 2 jj 
a l l n l = a 11 = a 1. 


But, the trace of P is 

tr(P) = tr(A(A T A) _1 A t ) — tr(A T A(A T A)~ 1 ) = i?'(/ s ) = s , 

since tr(BC) = tr(CB) for any n x s matrix B and s x n matrix C. Hence, 
the denominator of (11.53) converges in probability to a. By Lemma 11.3.3, the 
numerator of (11.53) converges in distribution to N(0,a 2 ) provided 

maxd 2 —>■ 0 as n —> oo . (11.54) 

Under this condition, the level of the t-test is therefore robust against 
nonnormality. 

So far, b = (b i,..., b s ) T has been fixed. To determine when the level of (11.53) 
is robust for all b with b 2 = 1, it is only necessary to find the maximum value 
of d 2 as b varies. By the Schwarz inequality 

di = - E'E . 

with equality holding when bj = atj /y^Efc a lk- :The desired maximum of d 2 is 
therefore JT a 2 j , and 

S 

max a 2 j —» 0 as n —¥ oo (11.55) 

3=1 

is a sufficient condition for the asymptotic normality of every 9 of the form (11.51). 
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The condition (11.55) depends on the particular parametrization (11.46) 
chosen for Iln. Note however that 

S 

E4i = nv, (11-56) 

3 = 1 

where Hi t j is the (i,j) element of the projection matrix P. 

This shows that the value of 14is coordinate free, i.e. it is unchanged by an 
arbitrary change of coordinates (3 * = B~ x (3, where B is a nonsingular matrix, 
since 

£ = A(3 = ABj3* = A* (3* 

with A* = AB, and 

P * = AB(B t A t AB)- 1 B t A t = ABB~ 1 {A t A)- 1 {B t )~ 1 BA = P . 

Hence, (11.55) is equivalent to the coordinate-free Huber condition 

maxn^i -> 0 as n —>■ oo . (11.57) 

i 

For evaluating n,,i, it is helpful to note that 

n 

£i ^ ' H iijXj (3 !;■■■: w ) j 

3=1 

so that n,.,; is simply the coefficient of X3 in £i, which must be calculated in any 
case to carry out the test. 

If n M < M n for all i = 1,..., n, then also Uij < M n for all i and j. This follows 
from the fact that there exists a nonsingular E with P = EE T , on applying the 
Cauchy-Schwarz inequality to the ( i,j) element of EE T . Condition (11.57) is 
therefore equivalent to 

max no as n —> oo . (11.58) 


Example 11.3.2 (Example 11.3.1, continued) In Example 11.3.1, the coef¬ 
ficient of Xi in = a + (3U is 


n^i — — + 
n 


(■U -1) 2 

Efe - 1) 2 


and the Huber condition reduces to the condition (11.49) found earlier. ■ 


Example 11.3.3 (Two-way Layout) Consider the two-way layout with m 
observations per cell and the additive model 

£ E(Xij'k) = fa T oci T (3j 

with 

'y ai = y Pj — o i 

i 3 

i — 1,..., a; j = 1,... b; k = 1,... m. It is easily seen (Problem 11.75) that, for 
fixed a and b, the Huber condition is satisfied as m —» oo. ■ 
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Let us next generalize the hypothesis (11.50) to hypotheses which impose sev¬ 
eral linear constraints such as (11.50). Without loss of generality, choose the 
parametrization in (11.46) in such a way that the s columns of A are orthogonal 
and of length one and make the transformation 

Y = CX 


(used in (7.1), where C is orthogonal and the first s rows of C are equal to those 
of A t , say 


C = 


a t 

D 


(11.59) 


for some (n — s) x n matrix D. If rji = E(Yi), we then have that 

V = ( ^A/3 = (/?!,..., /?., 0,...,0) T . (11.60) 

By the orthogonality of C, the Y; are independent with 1) distributed as 
N(r/i,a 2 ), where r/i = /3i for i = 1 ,..., s and rji = 0 for * = s + 1 , . .., n. We 
want to test 


H : y a itj 7y ~ 0 ; i = 1,... ,r 
l=i 


where we shall assume that the r vectors (a^i,..., cti, s ) T ar e orthogonal and of 
length one. Then the variables 


2 . = fE"=i a i,i Y i i = l,...,r 

1 Yi i = s + 1,.. , ,n 


are independent N(£i,cr 2 ) with 


rX^=i 

Ci = \ nt 
0 


i = l,...,r 
i = r + 1 ,..., s 
i = s + 1 ,..., n 


The standard UMPI test of H \ ^ = • • • = =f 0 rejects when 

EI=i Z?/r 


E;U + i ZVin-s) 


> k 


(11.61) 


(11.62) 


(11.63) 


where k is determined so that the probability of (11.63) is a when the Z s are 
normal and H holds. 

We shall now suppose that the model (11.46) is embedded in a sequence of 
such models defined by matrices A^ , with s fixed and n —> oo. Suppose that 
the A's are not normal but given by 


Xi = & + & , 

where the Us are i.i.d. according to a distribution F with mean 0 and variance 
a 2 < oo. We then have the following robustness result. 


Theorem 11.3.1 Let a n (F) denote the rejection probability of the test (11.63) 
when the Us have distribution F and the null hypothesis constraints are satisfied. 
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Then, a n (F) —» a provided 


mpE (»!?) 2 “»• 0 

3=1 


or equivalently 

maxll^ —> 0 , 

where is the ith diagonal element of P — A(A T A)^ 1 A T . 


(11.64) 


Proof. We must show that the limiting distribution of (11.63) is the same as 
when F is normal. First, we shall show that the denominator of (11.63) satisfies 


1 n 


ry 2 P 2 

Z 3 a 


J=# + l 


(11.65) 


Note that A' = C T Y and Y = QZ where C T and Q are both orthogonal. 
Therefore, 


1 


n — s 


n 


E z ? 

3=8+1 


n 


n — s 



1 


n — s 


E^ 2 


= —--E^ 2 - —E^ 2 - 

n — s n ' n — s z —' 

i= 1 i=1 

To see that this tends to a 2 in probability, we first show that 


1 


-E * 2 

n f ^ 


2 P 2 

<7 


But, 


E n y2 v-^n / y \2 r> > y v—vn (-2 

i= 1 W _ ^ 2-jj=l _|_ 2^i= 1 s i 


The first term on the right tends to a 2 in probability, by the Weak Law of Large 
Numbers. By the orthogonality of C, the last term is equal to 5E =1 P? /n, which 
tends to 0 since s is fixed. It is easily checked that the middle term has a mean 
and variance which tend to 0. Hence, ^X 2 /n tends in probability to a 2 . Next, 
we show that 


E Uz 2 


o. 


It suffices to show 

Y,UE{Z 2 ) _ EEl VarjZi) | E U^Z,)] 2 ; Q ^ 

n n n 

Since s is fixed and Var(Zi) = a 2 , we only need to show 


E U[E{Zi)f 


0 . 
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For i < r, 


and 


E{Zi) = Y UiJVi = Y ai ’ifc 

3 =1 1=1 


[E(z t )} 2 <Y*hYfi = E ^ 2 • 

3 = 1 1 = 1 1 = 1 

For r + 1 < * < s, E(Zi) = /3i, in which case the same bound holds. Therefore, 

T,UA E ( z i)] 2 ^ :0i 

n n 

and the result (11.65) follows. 

Next, we consider the numerator of (11.63). We show the joint asymptotic 
normality of (Z i,..., Z r ). By the Cramer-Wold device, it suffices to show that, 
for any constants 71 ,..., 7 r with JT 'yf = 1 , 

r 

Y^i^ N(0,a 2 ) . 

i=l 

Indeed, since the columns of A are orthogonal, /3 i = \\ for 1 < i < s and so 
Zi is a linear combination of /3i,... ,(3 S - But then so is iZi and asymptotic 
normality follows from the argument for 6 of the form (11.51). ■ 


Example 11.3.4 (Test of Homogeneity) Let Xij (j = 1,... rn; i = 1,..., s) 
be independently distributed as N(fj,i,a 2 ). The problem is to test the null 
hypothesis 


H : m - -— //. s . 

I 11 this case, the test (11.63) is UMP invariant and reduces to 

J2ni(Xj. - AT)7(s- 1) 

EE (Xij-Xi.y/in-8)’ 

where 


( 11 . 66 ) 


AT = Y Xij/n, , AT = Y E X u/ n 

j i 3 

and n = E; n i- If instead of Xij being assume that Xij has a dis¬ 

tribution F(x — Hi), where F is an arbitrary distribution with finite variance. 
Then, the theorem implies that, if mini rii —> 00 , then the rejection probability 
tends to a. In fact, the distributions may even vary within each sample, but it is 
important that the different samples have a common variance or the result fails; 
see Problems 11.85 and 11.86. ■ 
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11.4 Nonparametric Mean 

11.4-1 Edgeworth Expansions 

Suppose Xi, ..., X n are i.i.d. with c.d.f. F. Let p{F) denote the mean of F , and 
consider the problem of testing p(F) = 0. As in Section 11.3.1, let a„(F) denote 
the actual rejection probability of the one-sided f-test under F. It was seen that 
the t-test is pointwise consistent in level in the sense that a n (F) —> a whenever 
F has a finite nonzero variance a 2 (F). We shall now examine the rate at which 
the difference a n (F) — a tends to 0. 

In order to study this problem, we will consider expansions of the distribution 
function of the sample mean, as well as its studentized version. Such expansions 
are known as Edgeworth expansions. Let <&(•) denote the standard normal c.d.f. 
and ip(-) the standard normal density. Also let 

7 71 ’ a3{F) 

and 

The values 7 and k are known as the skewness and kurtosis of F, respectively. 

Theorem 11.4.1 Assume Ep(\Xi\ k+2 ) < 00 . Let i/jf denote the characteristic 
function of F, and assume 

limsup IV'f(s)! < 1 • (11.67) 

| S | —^OO 

Then, 

Pf{- -<*} = $(*) + n~ 3/2 y{x)p j (x,F) + r n {x,F) , (11.68) 

where r n (x,F) = o(n~ k ^ 2 ) and pj(x,F) is a polynomial in x of degree 3 j — 1 
which depends on F through its first j + 2 moments. In particular, 

pi(x, F) = ~^7(* 2 - 1) , (11.69) 

and 

p 2 (x,F) = -x 3) + ^ 7 2 (z 4 - ICte 2 + 15) . (11.70) 

Moreover, the expansion holds uniformly in x in the sense that, for fixed F, 
n _fc / 2 sup \r n (x, F)| —» 0 as n —> 00 . 

X 

The assumption (11.67) is known as Cramer’s condition and can be viewed as a 
smoothness assumption on F ; it holds, for example, if F is absolutely continuous 
(or more generally is nonsingular) but fails if F is a lattice distribution, i.e. A'i can 
only take on values of the form a+jb for some fixed a and b as j varies through the 
integers. A proof of Theorem 11.4.1 can be found in Feller (1971, Section XVI.4) 
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or Bhattacharya and Rao (1976), who also provide formulae for the Pj(x, F) when 
j > 2. The proofs hinge on expansions of characteristic function. 

Note that the term of order n~ x ^ 2 is zero if and only if the underlying skewness 
7 (F) is zero. This shows that the dominant error in using a standard normal 
approximation to the distribution of the standardized sample mean is due to 
skewness of the underlying distribution. Expansions such as these hold for many 
classes of statistics and provide more information than a weak convergence result, 
such as that provided by the Central Limit Theorem. As an example, the following 
result provides an Edgeworth expansion for the studentized sample mean. Let 
S 2 n = Y.i{Xi-X n ) 2 /{n-l). 

Theorem 11.4.2 Assume Ep{\Xi \ k+2 ) < oo and that F is absolutely continu¬ 
ous. 8 Then, uniformly in t, 

Pf{ nl/ “ [Xn ~ M(F)] <t} = $(t) + n- j/2 v(t) qj (t, F) + f n (t, F) , (11.71) 

>Jn , 

J = 1 

where n~ k ^ 2 sup t \ fn(t, E)| — > 0 and qj{t, F) is a polynomial which depends on F 
through its first j + 2 moments. In particular, 

qi(t,F) = p(2t 2 + 1) , (11.72) 

and 

q 2 (t,F)=t l K (f 2 -3)-^ 7 2 (t 4 + 2t 2 -3)~(f 2 + l) . (11.73) 

Example 11.4.1 (Expansion for the t-distribution) Suppose F is normal 
N(fj,,o 2 ). Let t n = n l ^ 2 (X n — p)/S n . Then, 7 (F) = k(F) = 0. By Theorem 
11.4.2, 

PF{t n <t} = <f>(f) — -^(t + t 3 )ip(t) + o(n _1 ) . (11.74) 

This result implies a corresponding expansion for the quantiles of the t- 
distribution, known as a Cornish-Fisher expansion. Specifically, let t = tn-i.i-c, 
be the 1 — a quantile of the t-distribution with n — 1 degrees of freedom. We 
would like to determine c = Ci_ a such that 

tn- 1.1—a = Zl — c H-h o(n ) . 

n 

When t = tn-i.i-a, the left side of (11.74) is 1 — a and the right side is by a 
Taylor expansion, 

$(«) + !l<p(z) - -^(2 + z 3 )<p(z) + o(n _1 ) , 
where 2 = z\- a . Since $( 2 ) = 1 — a, we must have 

^( 2 ) - 4^(2 + z 3 )<p(z) = o(n _1 ) 

8 Alternatively, one can assume Ejp(\Xi\ 2 ^ 2 ) < oo and the distribution of (Xi, Xf) 
satisfies the multivariate analogue of Cramer’s condition; see Hall (1992), Chapter 2. 
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so that 

C C'l — a —Zl — a ( 1 4“ ^1_ck) • 

Therefore, 

1 2 

n(f„-l,l-a - Zl-a ) -1 -Zl-a(l + 2l- a ) • ■ (11.75) 

In Section 11.3.1, we showed that the f-test has error in rejection probability 
tending to 0 as long as the underlying distribution has a finite nonzero variance. 
We will now make use of Edgeworth expansions in order to determine the orders 
of error in rejection probability for tests of the mean. All tests considered are 
based on the t-statistic t„. In order to study this problem, we consider three 
factors: the one-sided case which rejects for large t„ versus the two-sided case 
which rejects for large |i„|; the use of a normal critical value versus a t critical 
value; and the dependence on F, especially whether 7 (F) is 0 or not. For j = 1, 2, 
let ah j (F) denote the error in rejection probability under F of the j-sided test 
using the normal quantile, and let a„j(F) denote the analogous quantity using 
the appropriate t-quantile. For example, 

ah, 2 (F) = PF{\tn\ > i n -i,i-§} . 

We assume E F (Xf) < 00 and that F is absolutely continuous so that we can 
apply the Edgeworth expansions in Theorems 11.4.1 and 11.4.2 with k = 2. 

The One-sided Case. First, consider the test using the normal quantile. By 
(11.71), 

ah,i(F) - a = n~ 1/2 >p(zi- a )qi(zi- a , F) + n~ 1 ifi(zi- a )q 2 (zi- a , F) + o(n _1 ) . 

It follows that 

<i(F) - a = 0(n~ 1/2 ) . 

However, if 7 (F) = 0, then qi(zi- a , F) — 0 and so 

ah,i(F) - a = 0(n -1 ) 

in this case. Using the f-quantiles instead of the normal quantiles yields 

<i(F) - a = $(fn-i, a ) - a + n~ 1/2 <p(tn-i,i- a )qi(tn-i,i-a, F) + 0(n _1 ) . 

Then, applying (11.75), t n - i,i_ a — zi- a = 0(n -1 ), so that a Taylor’s expansion 
yields 

ah,i{F) - a = n~ 1/2 ip(zi- a )qi(zi- a , F) + O^n -1 ) . 

Therefore, 

a„,i(F) -a = 0(n~ 1/2 ) , 

but the error in rejection probability is 0(n -1 ) if 7 (F) = 0. 

The Two-sided Case. Let 2 = 2 i-§- Then, using the fact that ip(z) = <p(—z), 
ah, 2 (F) = P F {\tn\ > z} = 1 - [PF{tn < z} - Pp{tn < ~z}] 
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= a + n 1/2 ip(z)[qi(z,F) - q 1 (-z,F)} + 0(n *) . 

But, qi(-,F) is an even function, which implies 

®n, 2 (F) - a = 0(n _1 ) , 

even if 7 (F) is not zero. Similarly, it can be shown that (Problem 11.90) 

a n ,2{F) ~ « = 0(« -1 ) • (11.76) 


11.4-2 The t-test 

It was seen in Section 11.3.1 that the classical t-test of the mean is asymptoti¬ 
cally pointwise consistent in level for the class F of all distributions with finite 
nonzero variance. In Section 11.4.1, the orders of error in rejection probability 
were obtained for a given F. However, these results are not reassuring unless the 
convergence is uniform in F. If it is not, then for any n, no matter how large, there 
will exist F in F for which the rejection probability under F, a n (F), is not even 
close to q. We shall show below that the convergence is not uniform and that the 
situation is even worse than what this negative result suggests. Namely, we shall 
show that for any n, there exist distributions F for which a n (F) is arbitrarily 
close to 1 ; that is, the size of the t-test is 1 . 

Suppose Xi ,..., X n are i.i.d. real-valued random variables with unknown c.d.f. 
F£ F, where F is a large nonparametric class of distributions. Let p{F) denote 
the mean of F and a 2 (F) the variance of F. The goal is to test the null hypothesis 
p(F) = 0 versus p{F) > 0, or perhaps the two-sided alternative p(F) ^ 0. 

Theorem 11.4.3 For every n, the size of the t-test is 1 for the family Fo of all 
distributions with finite variance. 

Proof. Let c be an arbitrary positive constant less than one and let p n = 1 — c 1 ^ 71 
so that (1 — p n ) n = c. Let F = F nyC be the distribution that places mass 1 — p n 
at p n and mass p n at p n — 1, so that p(F) = 0. With probability c, we have 
all observations equal to p„. For such a sample, the numerator n 1 ' / 2 X„ of the 
t-statistic is n}^p n > 0 while the denominator is 0. Thus, the t-statistic blows 
up and the hypothesis will be rejected. The probability of rejection is therefore 
> c, and by taking c arbitrarily close to 1 the theorem is proved. (Note that one 
can modify the distributions F„ tC used in the proof to be continuous rather than 
discrete.) ■ 

It follows that the t-test is not even uniformly asymptotically level a for the 
family Fo. 

Instead of Fo, one may wish to consider the behavior of the t-test against 
other nonparametric families. If F 2 is the family of all symmetric distributions 
with finite variance, it turns out that the t-test is still not uniformly level a, and 
this is true even if the symmetric distributions have their support on (—1,1) or 
any other fixed compact set; see Romano (2004). In fact, the size of the t-test 
under symmetry is one for moderate values of a; see Basu and DasGupta (1995). 
However, it can be shown that the size of the t-test is bounded away from 1 for 
small values of a, by a result of Edelman (1990). Basu and DasGupta (1995) also 
show that if F 3 is the family of all symmetric unimodal distributions (with no 
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moment restrictions), then the largest rejection probability under F of the t-test 
occurs when F is uniform on [—1.1], at least in the case of very small a. 

On the other hand, we will now show that the t-test is uniformly consistent 
over certain large subfamilies of distributions with two finite moments. For this 
purpose, consider a family of distributions F on the real line satisfying 

'\X-p.(F)\ 2 


lim sup Ef 


Fe f 


a^F) 




= 0 . 


(11.77) 


For example, for any e > 0 and b > 0, let F 2+e be the set of distributions 
satisfying 

'\X - Li(F)\ 2+e ] 


Ef 


T 2+e 


(F) 


< b . 


— TP 2 + e 


Then, F 
the inequality 


satisfies (11.77). To see why, take expectations of both sides of 


A e y 2 /{|y| > a} < |y| 


2 + e 


Lemma 11.4.1 Suppose X n ,i,.. .., X ni „ are i.i.d. F n with F n £ F, where F 
satisfies (11.77). Let X n = X]™=i X n ,i/n. Then, under F n , 


n 1/2 [X n - p(F n )] 
<r{F n ) 


4 iv(o,i) . 


Proof. Let Y n ,» = [A„,j — p(F n )\/a(F n ). We verify the Lindeberg Condition 
(11.11), which in the case of n i.i.d. variables reduces to showing 

lim sup E [Y 2 , i/{|In,i| > en 1/2 }] = 0 

n 

for every e > 0. But, for every A > 0, 

limsupEfy^/dy^il > en 1/2 }] < limsupE[y£ i /{|y n , i | > A}] . 

n n 

Let A —» oo and the right side tends to zero. ■ 


Lemma 11.4.2 Let Y n , i,... ,Y n , n be i.i.d. with c.d.f. G n and finite mean p(G n ) 
satisfying 

lim lim sup Eg u [|y«,; - p{G n )\I{\Y n ,i - p(G „)| >/?}]= 0 . (11.78) 

(3—n—too 

Let Y n = X4=i X n ,i/n. Then, under G n , Y n — p(G„) —¥ 0 in probability. 

Proof. Without loss of generality, assume p(G n ) = 0. Define 

Z n ,i = Yn,iI{\Yn,i\ < Tl} . 

Let m n = E(Z n ,i) and Z n = Y(h=i z n,i/n. Then, the event (|y n — m n \ > e} 
implies either (| Z n — m n \ > e} occurs or {Y n Z n } occurs. Hence, for any e > 0, 

P{|y„ - m„| > e} < P{\Z n - m„| > e} + P{Y n Z n } . (11.79) 

The last term is bounded above by 

n n 

P{\J{Yn,i ± Zn,i}} <Y,P{ Y nd + Z n,i} = nP{\Y n ,i\ > n} . 



464 11. Basic Large Sample Theory 


The first term on the right side of (11.79) can be bounded by Chebyshev’s 
inequality, so that 

P{\Y n - m n | > e} < (ne 2 )~ 1 E(Z 2 1 ) + nP{\Y n<1 \ > n} . (11.80) 

For t > 0, let 

Tn(t) = f[l — Gn(t) + Gn( — t)] 

and 

1 pt 2 /** 

K n (t) = - X 2 dG„(t) = -T„(t) + - / T n {x)dx | (11.81) 

1 J-t. t Jo 

the last equality follows by integration by parts (Problem 11.96) and corrects 
(7.7), p.235 of Feller (1971). Hence, 

P{\Y n — m n | > e} < e~ 2 K n (n) + r„(n) . (11.82) 

But, for any t > 0, 

T„(t) < s[|y„,i|/{|y„,i| >*}], 

so r„(n) —» 0 by (11.78). Fix any 5 > 0 and let (Jo be such that 

limsupB[|y nil |/{|y ni i| >/3o}] < - . 

n 4 

Then, there is an no such that, for all n > no, 

E[\Y nA \I{\Y nA \>0 o }] < S - , 

and so 

£|y n ,i| < Po +\ 

for all n > no as well. Then, ii n > no > (Jo, 

- [ T„(x)dx < 1 /"/•: [|y„,i|/{|y n ,i| > ®}] dx 

n Jo n J 0 


1 /' /3 ° 1 /•" ,) 
<-/ £|y„,r|da;+- / ^d® < 
n Jo n J Po ^ 


A) (A) + f ) 

n 



which is less than 8 for all sufficiently large n. Thus, K n (n ) —¥ 0 as n —¥ oo and 
so (11.82) tends to 0 as well. Therefore, Y n — m n —> 0 in probability. Finally, 
m n —> 0; to see why, observe 


0 = E(Yn t i) = m n + E [y n ,i/{|y„,i| > n}] , 


so that 


\m n \<E[\Y n , 1 \I{\Y nA \>n}}->0 , 
by assumption (11.78). ■ 


Lemma 11.4.3 Let F be a family of distributions satisfying (11.77). Suppose 
X„ r i,... ,Xn r „ are i.i.d. F n £ F and p{F n ) = 0. Then, under F n , 


1. 

—— *7* x n,z —>• 1 in probability. 
cr 2 (F n ) 
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Proof. Apply Lemma 11.4.2 to Y n ,i = [X 2 :i /o 2 (F n )\ — 1. To see that Lemma 
11.4.2 applies, note that if (3 > 1, then the event {|Ln,i| > (3} implies 
Xli/a 2 {F n ) >/3+l (since X 2 ti /o 2 (F n ) > 0) and also |F n>i | < XlJa 2 (F n ). 
Hence, for (3 > 1, 


E[\Y n ,i\I{\Y n ,i\>p}\ < E 


r r I \ 


> \//3 +1} 


The sup over n then tends to 0 as (3 —> oo by the assumption F„ £ F. I 


We are now in a position to study the behavior of the f-test uniformly across 
a fairly large class of distributions. 


Theorem 11.4.4 Let F n £ F, where F satisfies (11.77). Assume 
n 1 ^ 2 /.t(F n )/a(F n ) —> <5 as n —» oo 

(where |<5| isallowedto beoo). Let X i,...,X n be i.i.d. with c.d.fF n , and consider 
the t-statistic 

t n = n 1/2 X n /S n , 

where X n is the sample mean and S^ is the sample variance. If |<5| < oo, then 
under F n , 

t„ 4 N(S, 1) . 

If S -+ oo (respectively, —oo), then t n —> oo (respectively, —oo) in probability 
under F n . 


Proof. Write 

_ n 1/2 [X n - p(F n )} n 1/2 p(F n )/a(F n ) 

" S n S n /a(F n ) 


The proof will follow if we show S„/a(F n ) —» 1 in probability under F n and if 


n 1/2 [X„ - n(F n )] 
o(F n ) 


4 IV(0,1) . 


(11.83) 


But the latter follows by Lemma 11.4.1. To show S 2 /a 2 (F n ) —» 1 in probability, 
use Lemma 11.4.3 (Problem 11.93). ■ 


Theorem 11.4.4 now allows us to deduce that the f-test is uniformly consistent 
in level, and it also yields a limiting power calculation. 


Theorem 11.4.5 Let F satisfy (11.77) and let Fo be the set of F in F with 
p(F) = 0. For testing p(F) = 0 versus p(F) > 0, the t-test that rejects when 
t„ > zi- a (or tn-i,i-a) is uniformly asymptotically level a over Fo; that is, 

\ sup PF{tn > Zl-a} ~ q| —> 0 (11.84) 

FeFo 

as n —> oo. Also, the limiting power against F n £ F with n 1 ^ 2 p(F n )/o(F n ) S 
is given by 

lim Pf u {t n > Zl-a} = 1 - $(zi-a ~ S) . 

n 


(11.85) 
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Furthermore, 

inf Pp{tn > Zl-a} -t 1 — $(zi-a ~ 8) . (11.86) 

{F6F: n 1 / 2 l i(,F)/< t(F)><5} 

Proof. To prove (11.84), if the result failed, one could extract a subsequence 
{T?,} with F n £ Fo such that 

PF n {tn > Zl-a} —t /3 ^ a . 

But this contradicts Theorem 11.4.4 since t„ is asymptotically standard normal 
under F n . The proof of (11.85) follows from Theorem 11.4.4 as well. To prove 
(11.86), again argue by contradiction and assume there exists a subsequence {F n } 
with n 1 / 2 fj,(F n )/a(Fn) > <5 such that 

PF n {t„ > Zl-a} -t 7 < 1 - $(3l-a - <5) • 

The result follows from (11.85) if n 1 ^ 2 /j.(F n )/a(F n ) has a limit; otherwise, pass 
to any convergent subsequence and apply the same argument. ■ 

Note that (11.86) does not hold if F is replaced by all distributions with 
finite second moments or finite fourth moments, or even the more restricted 
family of distributions supported on a compact set. In fact, there exists a se¬ 
quence of distributions {F n } supported on a fixed compact set and satisfying 
n 1//2 /.i(F n )/a(F n ) > 8 such that the limiting power of the f-test against this se¬ 
quence of alternatives is a; see Problem 11.97 for a construction. Nevertheless, 
the t-test behaves well for typical distributions, as demonstrated in Theorem 
11.4.5. However, it is important to realize the t-test does not behave uniformly 
well across distributions with large skewness, as the limiting normal theory fails. 


11.4-3 A Result of Bahadur and Savage 

The negative results for the t-test under the families of all distributions with 
finite variance, or even the family of symmetric distributions with infinitely many 
moments are perhaps unexpected in view of the fact that the t-test is pointwise 
consistent in level for any distribution with finite (nonzero) variance, but they 
should not really be surprising. After all, the t-test was designed for the family of 
normal distributions and not for nonparametric families. This raises the question 
whether there do exist more satisfactory tests of the mean for nonparametric 
families. 

For the family of distributions with finite variance and for some related families, 
this question was answered by Bahadur and Savage (1956). The desired results 
follows from the following basic lemma. 


Lemma 11.4.4 Let F be a family of distributions on IR satisfying: 

(i) For every F £ F, /r(F) exists and is finite. 

(ii) For every real m, there is an F £ F with /r(F) = m. 

(in) The family F is convex in the sense that, if Fi £ F and 7 £ [0,1], then 
7-Fi + (1 — 7)^2 £ F. 



11.4. Nonparametric Mean 467 


Let Xi, ..., X n be i.i.d. F £ F and let <p n = <t>n(X 1 ,..., X n ) be any test function. 
Let G m denote the set of distributions F £ F with p(F) = m. Then, 

inf E F (<t> n ) and sup E F ((/) n ) 

FtGm FGG m 


are independent of m. 


Proof. To show the result for the sup, fix mo and let Fj £ G mo be such that 
lim E F X(j> n ) = sup E F (rf) n ) = s . 

Fix mi. The goal is to show 

sup E F ((/>„) = s . 

Let Hj be a distribution in F with mean hj satisfying 

,, 1 , 1 , 

mi = (1-)mo H —hj 

J 3 

and define 

Gj = (1 - | )Fj + -H-j . 

3 3 

Thus, Gj £ G mi . An observation from Gj can be obtained through a two-stage 
procedure. First, a coin is flipped with probability of heads 1/j. If the outcome is 
a head, then the observation has the distribution Hj; otherwise, the observation 
is from Fj. So, with probability [1 — (1 /j)] n , a sample of size n from Gj is just a 
sample from Fj. Then, 

sup E G {(j>n) > E G A<t>n) > (1 - h" E F A<j> n ) ->• s 
Ge G mi 3 

as j —¥ oo. Thus, 

sup F g ( 0„) > sup E g {(pn) ■ 

GeG mi GG G m Q 

Interchanging the roles of mo and mi and applying the same argument makes 
the last inequality an equality. The result for the inf can be obtained by applying 
the argument to 1 — (/)„. M 


Theorem 11.4.6 Let F satisfy (i)-(iii) of Lemma 11.4-4- 

(i) Any test of H : p(F) = 0 which has size a for the family F has power < a 
for any alternative F in F. 

(ii) Any test of H : p(F) — 0 which has power (3 against some alternative F in 
F has size > /3. 

Among the families satisfying (i)-(iii) of Lemma 11.4.4 is the family Fo of 
distributions with finite second moment and that with infinitely many moments. 
Part (ii) of the above theorem provides an alternative proof of Theorem 11.4.3 
since the power of the t-test against the normal alternatives N(p, 1) tends to 1 as 
/i —» oo. Theorem 11.4.6 now shows that the failure of the t- test for the family of 
all distributions with finite variance is not the fault of the t-test; in this setting, 
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there exists no reasonable test of the mean. The reason is that slight changes in 
the tails of the distribution can result in enormous changes in the mean. 


11.4-4 Alternative Tests 

Another family satisfying conditions (i)-(iii) of Theorem 11.4.6 is the family of all 
distributions with compact support. However, the family of all distributions on a 
fixed compact set is excluded because it does not satisfy Condition (ii). In fact, 
the following construction due to Anderson (1967), shows that reasonable tests 
of the mean do exist if we assume the family of distributions is supported on a 
specified compact set. Specifically, let G be the family of distributions supported 
on [—1,1], and let Go be the set of distributions on [—1,1] having mean 0. We 
will exhibit a test that has size a for any fixed sample size n and all F £ Go, and 
is pointwise consistent in power. First, recall the Kolmogorov-Smirnov confidence 
band R n ,i-a given by (11.36). This leads to a conservative confidence interval 
In,i- a for p(F) as follows. Include the value fi in I„,i- a if and only if there exists 
some G in Rn,i-a with /x(G) = p. Then, 

{F £ R n ,l-c} C {p{F) £ I n ,l-c} 

and so 

Pf{p{F) £ > Pf{F £ R n ,l-a} > 1 ~ (X , 

where the last inequality follows by construction of the Kolmogorov-Smirnov 
confidence bands. Finally, for testing p(F) = 0 versus p,(F) ^ 0, let <j> n be the 
test that accepts the null hypothesis if and only if the value 0 falls in i_ a . By 
construction, 

sup E F (<l>n) < a . 

Fe g 0 

We claim that 

L,l-o C A'n ± 2n ^ 2 Sn,l-a , (11.87) 

where s n , i-a is the 1 — a quantile of the null distribution of the Kolmogorov- 
Smirnov test statistic. The result (11.87) follows from the following lemma. 

Lemma 11.4.5 Suppose F and G are distributions on [—1,1] with 

sup | F(t) — G(t)| < e . 

t 

Then, \p{F) — p(G)\ < 2e. 

For a proof, see Problem 11.94. The result (11.87) now follows by applying the 
lemma to F and the empirical cdf F n . 

Let F be a distribution with mean p{F) ^ 0. Suppose without loss of generality 
that p(F) > 0. Also, let L„ t i_ Q be the lower endpoint of the interval I n ,i-a ■ Then, 

Ef ( 0 ti ) > Pp{Ln,l-a > 0} > Pf{X„ > 2 n 1 ^ 2 Sn, 1-a} — > 1 , (11.88) 

by Slutsky’s theorem, since A'„ —» p{F) > 0 and n _1 ^ 2 s n , i-a —> 0. Thus, the 
test is pointwise consistent in power against any distribution in G having nonzero 
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mean. In fact, if {F n } is such that |n 1,/2 /x(-F 7l )| —» oo, then the limiting power 
against such a sequence is one (Problem 11.95). ■ 

While Anderson’s method controls the level and is pointwise consistent in 
power, it is not efficient; an efficient test construction which is of exact level a 
can be based on the confidence interval construction of Romano and Wolf (2000). 

Let us next consider the family of symmetric distributions. Here the mean 
coincides with the center of symmetry, and reasonable level a tests for this center 
exist. They can, for example, be based on the signed ranks. The one-sample 
Wilcoxon test is an example. A large family of randomization tests that control 
the level is discussed in 15.2. 

Finally, we mention a quite different approach to the problem considered in this 
section concerning the validity of the f-test in a nonparametric setting. Originally, 
the f-test was derived for testing the mean, /x, on the basis of a sample X \,..., X n 
from A(/x, ct 2 ). But, /x is not only the mean of the normal distribution but it is 
also, for example, its median. Instead of embedding the normal family in the 
family of all distributions with finite mean (and perhaps finite variance), we 
could obtain a different viewpoint by embedding it in the family of all continuous 
distributions F, and then test the hypothesis that the median of F is 0. A suitable 
test is then the sign test. 


11.5 Problems 

Section 11.1 

Problem 11.1 For each 9 £ fl, let fn{9) be a real-valued sequence. We say /„(#) 
converges uniformly (in 6) to f{9) if 

sup \fn{9) ~ f(6 )| -> 0 
sen 

as n — > oo. If Q if a finite set, show that the pointwise convergence /„(#) -4 f(9) 
for each fixed 9 implies uniform convergence. However, show the converse can fail 
even if Q. is countable. 


Section 11.2 

Problem 11.2 For a univariate c.d.f. F, show that the set of points of 
discontinuity is countable. 

Problem 11.3 Let A' be N( 0,1) and Y = X. Determine the set of continuity 
points of the bivariate distribution of (A', Y). 


Problem 11.4 Show that x = (aq,... , aq) T is a continuity point of the distri¬ 
bution Fx of A if the boundary of the set of (xq,..., y^) such that xq < Xi for all 
i has probability 0 under the distribution of A. Show by example that it is not 
sufficient for x to have probability 0 under Fx in order for r to be a continuity 
point. 
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Problem 11.5 Prove the equivalence of (i) and (vi) in the Portmanteau 
Theorem (Theorem 11.2.1). 


Problem 11.6 Suppose X n —> X. Show that Ef( A'„) need not converge to 
Ef(X) if / is unbounded and continuous, or if / is bounded but discontinuous. 

Problem 11.7 Show that the characteristic function of a sum of independent 
real-valued random variables is the product of the individual characteristic func¬ 
tions. (The converse is false; counterexamples are given in Romano and Siegel 
(1986), Examples 4.29-4.30.) 

Problem 11.8 Verify (11.9). 

Problem 11.9 Let X n have characteristic function Find a counterexample 
to show that it is not enough to assume (n(t) converges (pointwise in t) to a 
function £( t ) in order to conclude that X n converges in distribution. 

Problem 11.10 Show that Theorem 11.2.3 follows from Theorem 11.2.2. 

Problem 11.11 Show that Lyapounov’s Central Limit Theorem (Corollary 
11.2.1) follows from the Lindeberg Central Limit Theorem (Theorem 11.2.5). 

Problem 11.12 Suppose Xk is a noncentral chi-squared variable with fc de¬ 
grees of freedom and noncentrality parameter 5 2 . Show that (Xk — k)/(2k ) 1//2 —>■ 
N(p, 1) if 5 2 /(2k) 1 ^ 2 —>■ p as k —» oo. 

Problem 11.13 Suppose X n ,i ,..,, X niTl are i.i.d. Bernoulli trials with success 
probability p n . If p n H1 p £ (0,1), show that 

n 1/2 [X n -p n ]AN(0,p(l-p)) . 

Is the result true even if p is 0 or 1? 

Problem 11.14 Let X\...., X n be i.i.d. with density po or p\, and consider 
testing the null hypothesis H that po is true. The MP level-a test rejects when 
II’ I = 1 r(Xi) > Cn, where r(Xi) = pi(Xi)/po(Xi), or equivalently when 

^ {^logr(Xi) -Eo[logrpC)]} > k n . (11.89) 

(i) Show that, under H , the left side of (11.89) converges in distribution to 
N(0,a 2 ) with a 2 = Varo[log r(Xi)], provided a < oo. 

(ii) From (i) it follows that k n —> az\- a , where z a is the a quantile of N( 0,1). 

(iii) The power of the test (11.89) against p\ tends to 1 as n —» oo. Hint. Use 
Problem 3.39(iv). 


Problem 11.15 Complete the proof of Theorem 11.2.8 by considering n even. 
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Problem 11.16 Generalize Theorem 11.2.8 to the case of the pth sample 
quantile. 

Problem 11.17 Let Xi,... ,X n be i.i.d. normal with mean 6 and variance 1. 
Let X„ be the usual sample mean and let X n be the sample median. Let p„ be 
the probability that X n is closer to 9 than X n is. Determine lim n _ >00 p„. 

Problem 11.18 Suppose Xi ,..., X n are i.i.d. real-valued random variables with 
c.d.f. F. Assume 38\ < 62 such that F(9i) = 1/4, F(9 2 ) = 3/4, and F is dif¬ 
ferentiable, with density / taking positive values at 8 \ and 62 . Show that the 
sample inter-quartile range (defined as the difference between the .75 quantile 
and .25 quantile) is a y/n- consistent estimator of the population inter-quartile 
range ( 9 2 - 81 ). 

Problem 11.19 Prove Polya’s Theorem 11.2.9. Hint: First consider the case of 
distributions on the real line. 

Problem 11.20 Show that pl(F,G) defined in Definition 11.2.3 is a metric; 
that is, show Pl(F , G) = ph(G , F), Pl(F , G) = 0 if and only if F = G, and 

Pl (F, G ) < p L (F, H) + p L (H, G) . 

Problem 11.21 For cumulative distribution functions F and G on the real line, 
define the Kolmogorov-Smirnov distance between F and G to be 

cLk(F, G) = sup \F(x) — G(*)| . 

X 

Show that dx {F, G ) defines a metric on the space of distribution functions; that 
is, show dx(F , G) = dx(G, F), dx{F, G) = 0 implies F = G and 

d K (F, G) < d K (F, H) + d K (H , G) . 

Also, show that pl{F,G) < dx(F,G), where pL is the Levy metric. Construct a 
sequence F n such that pl{F„, F) —» 0 but dx(F„, F) does not converge to zero. 

Problem 11.22 Let F n and F be c.d.f.s on IR. Show that weak convergence of 
F„ to F is equivalent to pL(F n ,F) —» 0, where pL is the Levy metric. 

Problem 11.23 Suppose F and G are two probability distributions on lR fc . Let 
L be the set of (measurable) functions / from lR fc to IR. satisfying \f(x) — f(y)\ < 
\x — y\, where | • | is the usual Euclidean norm. Define the Bounded-Lipschitz 
Metric as 

A (F, G ) = sup{| E F f(X) - E G f(X)\ : f G £} . 

Show that F n -4 F is equivalent to A (F„,F) 0. Thus, weak convergence on 

]R fc is metrizable. [See examples 21-22 in Pollard (1984).] 

Problem 11.24 Construct a sequence of distribution functions { F n } on the real 
line such that F n converges in distribution to F, but the convergence F .~ 1 (1 — 
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a) —> F _1 (l — a) fails, even if F is assumed continuous. On the other hand, if F 
is assumed continuous (but not necessarily strictly increasing), show that 

F n (F~ 1 (l - a)) -*■ F(F~ 1 (1 -a)) = 1-a. 

[Note the left side need not be 1 — a since F n is not assumed continuous.] 

Problem 11.25 Prove part (ii) of Lemma 11.2.1. 

Problem 11.26 (Markov’s Inequality) Let A' be a real-valued random variable 
with X > 0. Show that, for any t > 0, 

1 J - t ~t 

here I(X > t ) is the indicator variable that is 1 if X > t and is 0 otherwise. 

Problem 11.27 (Chebyshev’s Inequality), (i) Show that, for any real-valued 
random variable A and any constants a > 0 and c, 

E{X - c) 2 > a 2 P{\X - c| > a} . 

(ii). Hence, if X n is any sequence of random variables and c is a constant such 
that E(X„ — c) 2 —» 0, then X n —> c in probability. Give a counterexample to 
show the converse is false. 


Problem 11.28 Give an example of an i.i.d. sequence of real-valued random 
variables such that the sample mean converges in probability to a finite constant, 
yet the mean of the sequence does not exist. 

Problem 11.29 If X n 0 and 

sup-E[|A n | 1+<s ] < oo for some <5 > 0 , (11.90) 

n 

then show I7[|X n |] —» 0. More generally, if the X n are uniformly integrable in the 
sense sup n S[|A n |7{|A„| > t}] —> 0 as t —» oo, then (11.90) holds. [A converse is 
given in Dudley (1989), p.279.[ 

Problem 11.30 Suppose X n and A' are real-valued random variables (defined 
on a common probability space). Prove that, if A'„ converges to X in probability, 
then X n converges in distribution to X. Show by counterexample that the con¬ 
verse is false. However, show that if X is a constant with probability one, then 
A n converging to X in distribution implies X n converges to X in probability. 

Problem 11.31 Suppose X n is a sequence of random vectors. 

(i) . Show X n —»• 0 if and only if |X n | -5- 0 (where the first zero refers to the zero 
vector and the second to the real number zero). 

(ii) . Show that convergence in probability of A'„ to A is equivalent to convergence 
in probability of their components to the respective components of X. 
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Problem 11.32 Suppose Xi,...,X n are i.i.d. real-valued random variables. 
Write Xi = Xf — X~, where Xf = max(X;, 0). Suppose X~ has a finite mean, 
but X/" does not. Let X n be the sample mean. Show X n —» oo. Hint: For B > 0, 
let Yi — Xi if Xi < B and Yi = B otherwise; apply the Weak Law to Y n . 

Problem 11.33 (i) Let K{Pq,P\) be the Kullback-Leibler Information, defined 
in (11.21). Show that K(Po , Pi) > 0 with equality iff Po = Pi- 
(ii) Show the convergence (11.20) holds even when K{Pq,P\) = oo. Hint: Use 
Problem 11.32. 

Problem 11.34 As in Example 11.2.4, consider the problem of testing P = Po 
versus P = Pi based on n i.i.d. observations. The problem is an alternative way 
to show that a most powerful level a (0 < a < 1) test sequence has limiting 
power one. If Po and Pi are distinct, there exists E such that Po(P) ^ Pi(P). 
Let p„ denote the proportion of observations in E and construct a level a test 
sequence based on p n which has power tending to one. 

Problem 11.35 If X n is a sequence of real-valued random variables, prove that 
X n —> 0 in P„-probability if and only if Ep n [ min(|X„|, 1)] —» 0. 

Problem 11.36 (i) Prove Corollary 11.2.3. 

(ii) Suppose X n -4- A' and -4 cxi. Show P{X„ < C n } —1 1. 

Problem 11.37 In Example 11.2.5, show that (3 n (Pn) —1 1 if n 1 ^ 2 (p n — 1/2) —> 
oo and f3 n (pn) —> a if n 1 ^ 2 ^ — 1/2) —> 0. 

Problem 11.38 In Example 11.2.7, let /„ be the interval (11.23). Show that, 
for any n, 

inf P p {p € I n } = 0 . 

p 

Hint: Consider p positive but small enough so that the chance that a sample of 
size n results in 0 successes is nearly 1. 

Problem 11.39 Show how the interval (11.25) is obtained from (11.24). 

Problem 11.40 Show that tightness of a sequence of random vectors in lR fe is 
equivalent to each of the component variables being tight 1R. 

Problem 11.41 Suppose P„ is a sequence of probabilities and X n is a se¬ 
quence of real-valued random variables; the distribution of X n under P„ is 
denoted C{X n \P n ). Prove that C{X n \P n ) is tight if and only if X n /a n —> 0 
in P n -probability for every sequence a n t 00 • 

Problem 11.42 Suppose X n N(p,a 2 ). (i). Show that, for any sequence of 
numbers c„, P(X n = c„) 0. (ii). If c„ is any sequence such that P(X„ > c„ ) —» 

a, then c n —> p + azi~ a , where zi- a is the 1 — a-quantile of N( 0,1). 
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Problem 11.43 Let Xi, ■ ■ ■,X n be i.i.d. normal with mean 9 and variance 1. 
Suppose 6 „ is a location equivariant sequence of estimators such that, for every 
fixed 9 , n 1//2 (0, l — 9) converges in distribution to the standard normal distribution 
(if 9 is true). Let A„ be the usual sample mean. Show that, if 9 is fixed at the 
true value, then n 1/,2 (#„ — A'„) tends to 0 in probability under 9. 


Problem 11.44 Prove part (ii) of Theorem 11.2.14. 


Problem 11.45 Suppose R is a real-valued function on IR fc with R(y ) = o(\y\ p ) 
as \y\ —» 0, for some p > 0. If Y n is a sequence of random vectors satisfying 
\Y n \ = op(l), then show R(Y n ) = op(|F„| p ). Hint: Let g(y) = R(y)/\y\ p with 
g( 0) = 0 so that g is continuous at 0; apply the Continuous Mapping Theorem. 

Problem 11.46 Use Problem 11.45 to prove (11.28). 


Problem 11.47 Assume ( Ui,Vi ) is bivariate normal with correlation p. Let p n 
denote the sample correlation given by (11.29). Verify the limit result (11.31). 

Problem 11.48 (i) If Ai,..., X n is a sample from a Poisson distribution with 
mean E(Xi) = A, then ^/n(vX — \/A) tends in law to A(0, |) as n —> oo. 

(ii) If A' has the binomial distribution b(p, n), then y6r[arcsin \JXjn — arcsin ^/p] 
tends in law to A(0, |) as n oo. 

Note. Certain refinements of variance stabilizing transformations are discussed by 
Anscombe (1948), Freeman and Tukey (1950), and Hotelling (1953). Transforma¬ 
tions of data to achieve approximately a normal linear model are considered by 
Box and Cox (1964); for later developments stemming from this work see Bickcl 
and Doksum (1981), Box and Cox (1982), and Hinkley and Runger (1984). 

Problem 11.49 Suppose A ij are independently distributed as N{pn,a 2 )-, i = 
1,..., s; j = 1,..., m. Let S 2 ^ = ~ Xi) 2 , where A; = n" 1 x i,j■ Let 

Z n ,i = \og[Sn,i/{rii — 1)]. Show that, as rii —> oo, 

y/rii- 1 [Z n>i - log(o- 2 )] 4- A(0, 2) . 

Thus, for large ra, the problem of testing equality of all the Oi can be approxi¬ 
mately viewed as testing equality of means of normally distributed variables with 
known (possibly different) variances. Use Problem 7.12 to suggest a test. 


Problem 11.50 Let Ai,---,A'„ be i.i.d. Poisson with mean A. Consider esti¬ 
mating g{ A) = e~ x by the estimator T n = e~ Xn . Find an approximation to the 
bias of T n ; specifically, find a function 6(A) satisfying 

E x (T n ) = s(A) + n _1 6( A) + 0(n~ 2 ) 

as n —> oo. Such an expression suggests a new estimator T n — n~ 1 b(X), which has 
bias 0{n~ 2 ). But, 6(A) is unknown. Show that the estimator T n — n^ 1 6(A'„) has 
bias 0 (n~ 2 ). 
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Problem 11.51 Let Xi,..., X n be a random sample from the Poisson distribu¬ 
tion with unknown mean A. The uniformly minimum variance unbiased estimator 
(UMVUE) of exp(-X) is known to be [{n — l)/n] T ", where T n = AV Find 
the asymptotic distribution of the UMVUE (appropriately normalized). Hint: It 
may be easier to first find the asymptotic distribution of exp(—T n /n). 

Problem 11.52 Let Xij, 1 < * < 7, 1 < j < n be independent with Xij 
Poisson with mean A i. The problem is to test the null hypothesis that the Ai are 
all the same versus they are not all the same. Consider the test that rejects the 
null hypothesis iff 

r _ n£^i(AL-V) 2 

V 

is large, where A"; = JX X'i,j/n and A = JA A \/1. 

(i) How large should the critical values be so that, if the null hypothesis is correct, 
the probability of rejecting the null hypothesis tends (as n —> oo with I fixed) to 
the nominal level a. 

(ii) Show that the test is pointwise consistent in power against any (Ai,..., A/), 
as long as the Ai are not all equal. 

Problem 11.53 Prove the Glivenko-Cantelli Theorem. Hint: Use the Strong 
Law of Large Numbers and the monotonicity of F. 

Problem 11.54 Let Xi,...,X„ be i.i.d. P on S. Suppose S is countable and 
let £ be the collection of all subsets of S. Let P„ be the empirical measure, that 
is, for any subset E of £, P n (E) is the proportion of observations Xj that fall in 
E. Prove, with probability one, 

sup \P n (E) — P(E)\ 0 . 

Eee 

p 

Problem 11.55 Suppose X n is a tight sequence and Y n —» 0. Show that 

p 

X n Y n —>■ 0. If it is assumed Y n —> 0 almost surely, can you conclude X n Y„ —> 0 
almost surely? 

Problem 11.56 For a c.d.f. F, define the quantile transformation Q by 
Q(u) = inf{t : F(t) > u} . 

(i) Show the event {F(t) > u} is the same as { Q(u ) < t}. 

(ii) If U is uniformly distributed on (0,1), show the distribution of Q(U) is F. 

Problem 11.57 Let Ui,... ,U„ be i.i.d. with c.d.f. G(u) = u and let G„ denote 
the empirical c.d.f. of Ui ,..., U n . Define 

B n {u) = n 1/2 [G n (u) — u] . 

(Note that B n (-) is a random function, called the uniform empirical process). 

(i) Show that the distribution of the Kolmogorov-Smirnov test statistic 
n 1 ^ 2 dnr(Gn, G) under G is that of sup u \B n (u)\. 
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(ii) Suppose Ai,...,X„ are i.i.d. F (not necessarily continuous), and let 
F n denote the empirical c.d.f. of Xi,...,X n . Show that the distribution 
of the Kolmogorov-Smirnov test statistic F) under F is that of 

sup t \B n (F(t))\, where B n is defined in (i). Deduce that this distribution does 
not depend on F when F is continuous. 

Problem 11.58 Consider the uniform confidence band R n ,i- a for F given by 
(11.36). Let F be the set of all distributions on 1R. Show, 

w^Pf{F € Rn,i-a} > 1 — a . 

Problem 11.59 Show how Theorem 11.2.18 implies Theorem 11.2.17. Hint: Use 
the Borel-Cantelli Lemma; see Billingsley (1995, Theorem 4.3). 

Problem 11.60 (i) If X\., X n are i.i.d. with c.d.f. F and empirical distri¬ 
bution F n , use Theorem 11.2.18 to show that n 1 ^ 2 sup \F n (t) — F(t) | is a tight 
sequence. 

(ii) Let F n be any sequence of distributions, and let F„ be the empirical dis¬ 
tribution based on a sample of size n from F n . Show that n 1 ' 2 sup | F n (t) — F n (t)\ 
is a tight sequence. 

Problem 11.61 Show that X n —> X in probability is equivalent to the state¬ 
ment that, for any subsequence X rlj , there exists a further subsequence X nj 
such that X n . —> X with probability one. 


Section 11.3 

Problem 11.62 (i) Let Xi,..., X'„ be a sample from X(£, a 2 ). For testing £ = 0 
against £ > 0, show that the power of the one-sided one-sample t-test against a 
sequence of alternatives N(£ n ,a 2 ) for which —¥ S tends to 1 — <!>(zi_ a — 

<5). 

(ii) The result of (i) remains valid if AT,..., AT are a sample from any distribution 
with mean £ and finite variance <r 2 . 

Problem 11.63 Generalize the previous problem to the two-sample f-test. 

Problem 11.64 Let (Y), Z. t ) be i.i.d. bivariate random vectors in the plane, with 
both Yi and Zi assumed to have finite nonzero variances. Let /zy = E(Y\) and 
/zz = E(Z i), let p denote the correlation between Y\ and Z\, and let p n denote 
the sample correlation, as defined in (11.29). 

(i) . Under the assumption p = 0, show directly (without appealing to Example 
11.2.10) that n 1,/2 p n is asymptotically normal with mean 0 and variance 

t 2 = Var[{Y i - ii v )(Zi - p z )]/V ar(lT)Uar(Zi). 

(ii) . For testing that Yi and Z\ are independent, consider the test that rejects 
when n 1 ^ 2 \p n \ > Zi-&. Show that the asymptotic rejection probability is a, 
without assuming normality, but under the sole assumption that Yi and Z\ have 
arbitrary distributions with finite nonzero variances. 
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(iii) . However, for testing p = 0, the above test is not asymptotically robust. 
Show that there exist bivariate distributions for (Yi, Z\) for which p = 0 but the 
limiting variance r 2 can take on any given positive value. 

(iv) . For testing p — 0 against p > 0, define a denominator D n and a critical 
value c n such that the rejection region n 1 ^ 2 p n /D n > c„ has probability tending 
to a, under any bivariate distribution with p = 0 and finite, nonzero marginal 
variances. 

Problem 11.65 Under the assumptions of Lemma 11.3.1, compute Cov(X 2 , X 2 ) 
in terms of pij and a 2 . Show that Var(n _1 ^” =1 X 2 ) —» 0 and hence 

n 2-ri=l ^ cr • 

Problem 11.66 (i) Given p, find the smallest and largest value of (11.42) as 
<t 2 /t 2 varies from 0 to oo. 

(ii) For nominal level a = .05 and p = .1, .2, .3, .4, determine the smallest and 
the largest asymptotic level of the f-test as <t 2 /t 2 varies from 0 to oo. 

Problem 11.67 Verify the formula for Var(X) in Model A. 

Problem 11.68 In Model A, suppose that the number of observations in group 
% is Hi. if rii < M and s —» oo, show that the assumptions of Lemma 11.3.1 are 
satisfied and determine 7 . 

Problem 11.69 Show that the conditions of Lemma 11.3.1 are satisfied and 7 
has the stated value: (i) in Model B; (ii) in Model C. 

Problem 11.70 Determine the maximum asymptotic level of the one-sided t- 
test when a = .05 and m = 2,4, 6 : (i) in Model A; (ii) in Model B. 

Problem 11.71 Prove (i) of Lemma 11.3.2. 

Problem 11.72 Prove Lemma 11.3.3. Hint: For part (ii), use Problem 11.61. 
Problem 11.73 Verify the claims made in Example 11.3.1. 

Problem 11.74 Verify (11.52). 

Problem 11.75 In Example 11.3.3, verify the Huber Condition holds. 

Problem 11.76 Let X\jk ( k = 1,..., nij\ i = 1 ,a; j = 1,..., b) be inde¬ 
pendently normally distributed with mean E(Xijk) = i\ij and variance <r 2 . Then 
the test of any linear hypothesis concerning the has a robust level provided 
nij —¥ 00 for all i and j. 

Problem 11.77 In the two-way layout of the preceding problem give examples 
of submodels and n® of dimensions Si and S 2 , both less than ab, such that 
in one case the condition (11.57) continues to require nij —» 00 for all i and j but 
becomes a weaker requirement in the other case. 
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Problem 11.78 Suppose (11.57) holds for some particular sequence with 
fixed s. Then it holds for any sequence 11^“' C IIq^ of dimension s' < s. 

Hint: If Iln is spanned by the s columns of A, let IIq be spanned by the first s' 
columns of A. 

Problem 11.79 Show that (11.48) holds whenever c„ tends to a finite nonzero 
limit, but the condition need not hold if c n —> 0. 

Problem 11.80 Let {c„} and {c' n } be two increasing sequences of constants 
such that c' n jCn —¥ 1 as n —> oo. Then {c n } satisfies (11.48) if and only if {c' n } 
does. 

Problem 11.81 Let c„ = uo+uin + - • • +Ukn k ,Ui > 0 for all i. Then c n satisfies 
(11.48). What if c„ = 2"? Hint: Apply Problem 11.80 with c' n = n k . 

Problem 11.82 If £* = a + /3ti + ■yui , express the condition (11.57) in terms of 
the f’s and u’s. 

Problem 11.83 If IL,* are defined as in (11.56), show that n 2 y = s. 

Hint: Since the IL,, are independent of A, take A to be orthogonal. 

Problem 11.84 The size of each of the following tests is robust against 
nonnormality: 

(i) the test (7.24) as b —> oo, 

(ii) the test (7.26) as mb —> oo, 

(iii) the test (7.28) as m —» oo. 

Problem 11.85 For i = 1,..., s and j = 1,..., m, let Xij be independent, with 
Xij having distribution Ft, where Ft is an arbitrary distribution with mean (m 
and finite common variance a 2 . Consider testing /(!-=.••• - /'« based on the test 
statistic (11.66), which is UMPI under normality. Show the test remains robust 
with respect to the rejection probability under Ho even if the Fi differ and are 
not normal. 

Problem 11.86 In the preceding problem, investigate the rejection probability 
when the Fi have different variances. Assume minrij —> oo and Ui/n —> pi. 

Problem 11.87 Show that the test derived in Problem 11.49 is not robust 
against nonnormality. 

Problem 11.88 Let X \,..., X n be a sample from and consider the 

UMP invariant level-a test of H : £/cr < do (Section 6.4). Let ct n {F) be the actual 
significance level of this test when X\..... X n is a sample from a distribution F 
with E(Xi) = Var(Xi) = a 2 < oo. Then the relation a n (F) —> a will not 
in general hold unless do = 0. Hint: First find the limiting joint distribution of 
y/n(X — £) and y6r(S 2 — a 2 ). 
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Section 11.4 

Problem 11.89 When sampling from a normal distribution, one can derive an 
Edgeworth expansion for the f-statistic as follows. Suppose Xi,... ,X n are i.i.d. 
N(/j,,a 2 ) and let t n = n}^ 2 (X n — n)/S n , where S' 2 is the usual unbiased estimate 
of <7 2 . Let $ be the standard normal c.d.f. and let $' = ip. Show 

P{tn < t} = $(t) — -^(t + t 3 )ip(t) + 0(n~ 2 ) (11.91) 

as follows. It suffices to let p = 0 and a = 1. By conditioning on S n , we can write 

P{t n <t} = E{$[t{ 1 + Si - 1) 1/2 ]} . 

By Taylor expansion inside the expectation, along with moments of S 2 , one can 
deduce (11.91). 

Problem 11.90 Assuming F is absolutely continuous with 4 moments, verify 
(11.76). 


Problem 11.91 Let <p n be the classical t-test for testing the mean is zero versus 
the mean is positive, based on n i.i.d. observations from F. Consider the power of 
this test against the distribution N(p, 1). Show the power tends to one as p —» oo. 


Problem 11.92 Suppose F satisfies the conditions of Theorem 11.4.6. Assume 
there exists rf>„ such that 


Show that 


sup E F {<j>n) a . 

F6F: n(F)=0 


lim sup E F {(l>n) < a 

n 

for every F £ F. 


Problem 11.93 In the proof of Theorem 11.4.4, prove S n /a(F n ) —> 1 in 
probability. 

Problem 11.94 Prove Lemma 11.4.5. 


Problem 11.95 Consider the problem of testing p(F) = 0 versus p(F) ^ 0, for 
F £ Fo, the class of distributions supported on [0,1]. Let (f>n be Anderson’s test. 

(i) If 

|n 1/2 /r(F„)| > 5 > 2s n ,i_ ct , 


then show that 


E Fn {4>n) > 1 


1 

2{2s n , 1 - a - 5) 2 ’ 


where s n , i-o is the 1 — a quantile of the null distribution of the Kolmogorov- 
Smirnov statistic. Hint: Use (11.88) and Chebyshev’s inequality. 

(ii) Deduce that the minimum power of </>n over {F : n 1 ^ 2 /.i(F)\ > (5} is at least 
1 - [2(2s n ,i-« - <5)- 2 ] if 5 > 2 Sn ,i- a . 
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(iii) Use (ii) to show that, if F n £ Fo is any sequence of distributions satisfying 
n 1/2 \p(F n )\ —> oo, then E Fn (cp„) 1. 

Problem 11.96 Prove the second equality in (11.81). In the proof of Lemma 
11.4.2, show that n n (n) —» 0. 

Problem 11.97 Let l»,i, ..., y n ,n be i.i.d. bernoulli variables with success prob¬ 
ability p n , where np n = A and X 1/l2 = 8. Let U„p,... ,U n ,n be i.i.d. uniform 
variables on (— r n ,T n ), where t 2 = 3p 2 . Then, let = Y n + Ui, so that F n is 
the distribution of X n ^. (Note that n 1 ^ 2 p(F n ) / a(F n ) = 8.) 

(i) If tn is the t-statistic, show that, under F n , t„ -4 U 1 / 2 , where V is Poisson 
with mean S 2 , and so if zi- a is not an integer, 

P Fn {tn > tn— 1 .1 —ck } —t P{y X ^ 2 > Zl-a} . 

(ii) Show, for a < 1/2, the limiting power of the t -test against F n satisfies 

P{V 1/2 > U-cj < 1 - P{V = 0} = exp(—5 2 ) . 

This is strictly smaller than 1 — <&(z\- a — 5) if and only if 

4>(zi_ a - 6) < exp(—d 2 ) . 

Certainly, for small 8, this inequality holds, since the left hand side tends to 1 — a 
as 5 —¥ 0 while the right hand side tends to 1. 


11.6 Notes 

The convergence concepts in Section 11.2 are classical and can be found in most 
graduate probability texts such as Billingsley (1995) or Dudley (1989). The Cen¬ 
tral Limit Theory for Bernoulli trials dates back to de Moivre (1733) and for 
more general distributions to Laplace (1812). Their treatment was probabilistic 
and did not involve problems in inference. Normal experiments were first treated 
in Gauss (1809). Further history is provided in Stigler (1986) and Hald (1990, 
1998). 

Concern about the robustness of classical normal theory tests began to be 
voiced in the 1920s (Neyman and Pearson (1928), Shewhart and Winters (1928), 
Sophister (1928), and Pearson (1929)) and has been an important topic ever 
since. Particularly influential were Box (1953), where the term robustness was in¬ 
troduced; also see Scheffe (1959, Chapter 10), Tukey (1960) and Hotelling (1961). 
The robustness of regression tests studied in Section 11.3.3 is based on Huber 
(1973). 

As remarked in Example 11.3.4, the F-test for testing equality of means is not 
robust if the underlying variances differ, even if the sample sizes are equal and s > 
2; see Scheffe (1959). More appropriate tests for this generalized Behrens-Fisher 
problem have been proposed by Welch (1951), James (1951), and Brown and 
Forsythe (1974b), and are further discussed by Clinch and Kesselman (1982). The 
corresponding robustness problem for more general linear hypotheses is treated 
by James (1954) and Johansen (1980); see also Rothenberg (1984). 
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The linear model E-test—as was seen to be the case for the f-test—is highly 
nonrobust against dependence of the observations. Tests of the hypothesis that 
the covariance matrix is proportional to the identity against various specified 
forms of dependence are considered in King and Hillicr (1985). For recent work 
on robust testing in linear models, see Muller (1998) and the references cited 
there. 

The usual test for equality of variances is Bartlett’s test, which is discussed in 
Cyr and Monoukian (1982) and Glaser (1982). Bartlett’s test is highly sensitive 
to the assumption of normality, and therefore is rarely appropriate. More robust 
tests for this latter hypothesis are reviewed in Conover, Johnson, and Johnson 
(1981). For testing homogeneity of covariance matrices, see Beran and Srivastava 
(1985) and Zhang and Boos (1992). 

Robustness properties of the t -test are studied in Efron (1969), Lehmann and 
Loll (1990), Basu and DasGupta (1995), Basu (1999) and Romano (2004). The 
nonexistence results of Bahadur and Savage (1956), and also Hoeffding (1956), 
have been generalized to other problems; see Donoho (1988) and Romano (2004) 
and the references there. 

The idea of expanding the distribution of the sample mean in order to study the 
error in normal approximation can be traced to Chebyshev (1890) and Edgeworth 
(1905). But it was not until Cramer (1928, 1937) provided some rigorous results. 
The fundamental theory of Edgeworth expansions is developed in Bhattacharya 
and Rao (1976); also see Bickel (1974), Bhattacharya and Ghosh (1978), Hall 
(1992) and Hall and Jing (1995). 



12 

Quadratic Mean Differentiable 
Families 


12.1 Introduction 

As mentioned at the beginning of Chapter 11, the finite sample theory of opti¬ 
mality for hypothesis testing applied only to rather special parametric families, 
primarily exponential families and group families. On the other hand, asymptotic 
optimality will apply more generally to parametric families satisfying smoothness 
conditions. In particular, we shall assume a certain type of differentiability con¬ 
dition, called quadratic mean differentiability. Such families will be considered in 
Section 12.2. In Section 12.3, the notion of contiguity will be developed, primarily 
as a technique for calculating the limiting distribution or power of a test statistic 
under an alternative sequence, especially when the limiting distribution under 
the null hypothesis is easy to obtain. In Section 12.4, these techniques will then 
be applied to classes of tests based on the likelihood function, namely the Wald, 
Rao, and likelihood ratio tests. The asymptotic optimality of these tests will be 
established in Chapter 13. 


12.2 Quadratic Mean Differentiability (q.rn.d.) 

Consider a parametric model {Pg,9 £ II}, where, throughout this section, is 
assumed to be an open subset of IR fe . The probability measures Pg are defined 
on some measurable space ( X,C ). Assume each Pg is absolutely continuous with 
respect to a cr-finite measure /i, and set pe(x) = dPg(x)/d/.i(x). In this section, 
smooth parametric models will be considered. To motivate the smoothness condi¬ 
tion given in Definition 12.2.1 below, consider the case of n i.i.d. random variables 
Xi,..., X n and the problem of testing a simple null hypothesis 9 = 9q against a 
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simple alternative 0i (possibly depending on n). The most powerful test rejects 
when the loglikelihood ratio statistic 

log[L n (0i)/L n (0 o )] 

is sufficiently large, where 

n 

L n {e )=n pe(Xi) (12.1) 

i =1 

denotes the likelihood function. We would like to obtain certain expansions of 
the loglikelihood ratio, and the smoothness condition we impose will ensure the 
existence of such an expansion. 

Example 12.2.1 (Normal Location Model) Suppose Pe is N(9,a 2 ), where 
a 2 is known. It is easily checked that 

tog[L n (Oi)/L n (.o 0 )] = 4[(ft - 9 0 )x n - Uel - e 2 0 )], ( 12 . 2 ) 

<7 Z 2 

where X n = Xi/n. By the Weak Law of Large Numbers, under 9o, 

(01 - 0 O )A'„ 9l) 4 (01 - 0o)0o - i(0i - 00) = -\{9i- 0o) 2 , 

p 

and so log[Z/„(0i)/L„(0o)] —t —oo. Therefore, log[L n (0i)/L n (0o)] is asymptoti¬ 
cally unbounded in probability under 0o. As in Example 11.2.5, a more useful 
result is obtained if 0i in (12.2) is replaced by 0o + hn~ x ^ 2 . We then find 

log[L„(0 o + /m- 1/2 )/L„(0 o )] = hnl/2 (^- 9 °) ~ = hZn - ^ , (12.3) 

where Z n = n^ 2 (X„ — 0o)/cr 2 is N( 0, 1/a 2 ). Notice that the expansion (12.3) is 
a linear function of Z n and a simple quadratic function of h, with the coefficient 
of h 2 nonrandom. Furthermore, log[L n (0o + hn~ x ^ 2 ) /L n (0o)] is distributed as 
N(—h 2 /2a 2 , h 2 /a 2 ) under 0o for every n. (The relationship that the mean is the 
negative of half the variance will play a key role in the next section.) ■ 

The following more general family permits an asymptotic version of (12.3). 

Example 12.2.2 (One-parameter Exponential Family) Let AT,..., X n be 

i.i.d. having density 

pe(x) = exp[0T(a:) - A(0)\ 

with respect to a cr-finite measure p. Assume 0o lies in the interior of the natural 
parameter space. Then, 

n 

log[L n (0 o + hn~ 1,2 )/L n {9 0 )] = hn~ 1/2 ^ T(XQ - n[A(9 0 + hn~ 1/2 ) - A(0 O )] . 

i =1 

Recall (Problem 2.16) that Eg 0 [T(Xi)\ = A'(9o) and Vare 0 [T(Xi)] — A"(9o). By 
a Taylor expansion, 

n[A(0 o + hn~ 1/2 ) — A(0 O )] = hn 1/2 A'(9 0 ) + ^ h 2 A"(6 0 ) + o(l) 
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as n —> oo, so that 

log[L n (#o + hn~ 1/2 )/L n (e o)] = hZ n - i h 2 A"(6 0 ) + o(l) , (12.4) 

where, under #o, 

n 

Z n = n~ 1/2 - Ee 0 [T(Xi)}} 4 JV(0, A"(0 O )) . 

i=l 

Thus, the loglikelihood ratio (12.4) behaves asymptotically like the loglikclihood 
ratio (12.3) from a normal location model. As we will see, such approximations 
allow one to deduce asymptotic optimality properties for the exponential model 
(or any model whose likelihood ratios satisfy an appropriate generalization of 
(12.4)) from optimality properties of the simple normal location model. ■ 

We would like to obtain an approximate result like (12.4) for more general 
families. Classical smoothness conditions usually assume that, for fixed x, the 
function po(x) is differentiable in 9 at #o; that is, for some function pg(x), 

Pe 0 +h(x) -Pe 0 {x) - (pe 0 {x),h) = o(\h\) 

as \h\ —» 0. In addition, higher order differentiability is typically assumed with 
further assumptions on the remainder terms. In order to avoid such strong as¬ 
sumptions, it turns out to be useful to work with square roots of densities. For 
fixed x, differentiability of p^ 2 (*) at 9 = Oo requires the existence of a function 
r/(x,9o) such that 

R(x, 9 0 ,h)= p^+Jx) - 4f(x) - {v(x, 9 0 ), h) = o(|fc|) . 

To obtain a weaker, more generally applicable condition, we will not require 
R 2 (x,9o,h) = o(\h\ 2 ) for every x, but we will impose the condition that 
R 2 (X, 9o, h) averaged with respect to p is o(|ft| 2 ). Let L 2 (p) denote the space 
of functions g such that f g 2 (x)dg(x ) < oo. The convenience of working with 
square roots of densities is due in large part to the fact that Pg^ 2 (-) £ L 2 (p), a 
fact first exploited by Le Cam; see Pollard (1997) for an explanation. The desired 
smoothness condition is now given by the following definition. 


Definition 12.2.1 The family { Pg,9 £ 11} is quadratic mean differentiable 
(abbreviated q.m.d.) at #o if there exists a vector of real-valued functions 
»?(•, 9 0 ) = (??i(-,#o),..., rjk (■, 9 0 )) T such that 


/ \fpe 0 +h{x) - \Jpg 0 (x) - < r)(x,0 o ),h > dp(x) = o(\h\ 2 ) (12.5) 

J x L J 


as \h\ —y 0. 1 


The vector-valued function g(-,9o) will be called the quadratic mean derivative 
of Pg at 9o■ Clearly, r](x,9o) is not unique since it can be changed on a set of x 
values having /r-measure zero. If q.m.d. holds at all 9o, then we say the family is 
q.m.d. 


1 The definition of q.m.d. is a special case of Frechet differentiability of the map 
9 —¥ p^ 2 (-) from SI to L 2 (g). 
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The following are useful facts about q.m.d. families. 

Lemma 12.2.1 Assume {Pe,9 £ 12} is q.m.d. at 8 q. Let h. £ IR fc . 

(i) Under Pe 0 , ( , h) is a random variable with mean 0; i.e., satisfying 

Pe 0 W 

J P 1 e / 2 {x){v{x,e 0 ),h)dp{x) = 0 . 

(ii) The components of ri(-, 8 o) are in L 2 {p); that is, for i = 1,..., k, 

J rj 2 ( x , 9o) dp{x) < oo . 

Proof. In the definition of q.m.d., replace h. by hn ~ x ^ 2 to deduce that 

/ { nV2 \?eo+hn- V 3 (®) - (v(x,9 0 ),h)} dp{x) -s- 0 

as n —» oo. But, if f (g n — g ) 2 dp —> 0 and f g^dp < oo, then f g 2 dp < oo 
(Problem 12.3). Hence, for any h £ IR fe , {r/{x,8o),h) £ L 2 {p). Taking h equal 
to the vector of zeros except for a 1 in the ith component yields {ii). Also, if 
j(g n — g) 2 dp —> 0 and J p 2 dp < oo then f pg„dp —> f pg dp (Problem 12.4). 

Taking p = p 1 / 2 and g n = n 1/2 [p^+^-i /2 (*) ~ Pef(z)] yields 
J P 1 e / o 2 {x)(v{x,0 0 ),h)dp(x) 

= ^ ~ P^^dpix) 

= n li %o nl/2 

= — | limn -1/2 n j\p\ / 2 {x) - (x)] 2 dp(x) . 

But, 

n f [P 1 e / 0 2 (x)-p 1 e £ hB _ 1/2 (aO] dp{x) 

-> J \{g{x, 8 0 ),h)\ 2 dp{x) < 00 , (12.6) 

and (i) follows. ■ 

Note that Lemma 12.2.1 (i) asserts that the finite-dimensional set of vectors 
{{v{-, 80 ), h), h £ IR fc } in L 2 {p ) is orthogonal top^ 2 (-). 

It turns out that, when q.m.d. holds, the integrals of products of the components 
of »;(•, 8 ) play a vital role in the theory of asymptotic efficiency. Such values (mul¬ 
tiplied by 4 for convenience) are gathered into a matrix, which we call the Fisher 
Information matrix. The use of the term information is justified by Problem 12.5. 
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Definition 12.2.2 For a q.m.d. family with derivative r/(-, 8), define the Fisher 
Information matrix to be the matrix 1(d) with (i,j) entry 

li,j(0) = 4 J Vi(x,0)ru(x,6)dti(x) . 

The existence of 1(9) follows from Lemma 12.2.1 (ii) and the Cauchy-Schwarz 
inequality. Furthermore, 1(9) does not depend on the choice of dominating 
measure p (Problem 12.8). 

Lemma 12.2.2 For any h £ IR fc , 

f \(h, r/(x, 9 0 ))\ 2 dp(x) = \ (hj(9 0 )h) . 


Proof. Of course 


(h,r/(x,6o)) =T,hir]i(x,9o) ■ 

Square it and integrate. ■ 


Next, we would like to determine simple sufficient conditions for q.m.d. to 
hold. Assuming that the pointwise derivative of pe(x) with respect to 8 exists, 
one would expect that the quadratic mean derivative r)(-,9o) is given by 


Vi(',0) 



! JtMz) 
2 Pl /2 (x) 


(12.7) 


In fact, Hajek (1972) gave sufficient conditions where this is the case, and the 
following result for the case k — 1 is based on his argument. 


Theorem 12.2.1 Suppose Q is an open subset of JR and fix 9o £ 12. Assume 
p^ 2 (x) is an absolutely continuous function of 8 in some neighborhood of do, for 
p-almost all x. 2 Also, assume for p-almost all x, the derivative p'e(x) of pe(x) 
with respect to 8 exists at 8 = do- Define 

<*“> 

if pe(x) > 0 and p'g(x) exists and define r/(x,9) = 0 otherwise. Also, assume the 
Fisher Information 1(8) is finite and continuous in 8 at do- Then, {-P@} is q.m.d. 
at do with quadratic mean derivative r/(-,9o). 

Proof. If Pe(x) > 0 and p'g(x) exists, then from standard calculus it follows that 

Pe /2 ( x ) = V{x,9) . 


2 A real-valued function g defined on an interval [a, b] is absolutely continuous if 
g(0) = g(a) + h(x)dx for some integrable function h and all 0 E [a, &]; Problem 
2 on p.182 of Dudley (1989) clarifies the relationship between this notion of absolute 
continuity of a function and the general notion of a measure being absolute continuous 
with respect to another measure, as defined in Section 2.2. 
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Also, if pe(x) = 0 and p'g(x) exists, then p'g(x) = 0 (since pg(-) is nonnegative). 
Now, if x is such that p 1 J 2 (x) is absolutely continuous in [9o,9o + <5], then 




rj(x, 9 q + X)dX 


< 


i r s 

- J r/ 2 (x, 9 0 + X)dX 


Integrating over all x with respect to p yields 

^\pli+s( x ) ~Pei 2 ( x )]^ d P( x ) - f Q T ( e ° + XS>dX ■ 


By continuity of 1(9) at 9q, the right hand side tends to 


^I(9 0 ) = J rj 2 (x,9 0 )dp(x) 

as 6 —> 0. But, for /i-alniost all x, 

\\Pe' 0 2 +s( x ) -Pe' 0 2 ( x )] ->v{x,0 o ) ■ 

The result now follows by Vitali’s Theorem (Corollary 2.2.1). ■ 


Corollary 12.2.1 Suppose p is Lebesgue measure on IR and that pe(x) = f(x — 
9) is a location model, where / 1 ^ 2 (-) is absolutely continuous. Let 


v( x , 9) 


-fix-9) 
2fi/ 2 (x-9) 


if f(x — 9) > 0 and f'(x — 9) exists; otherwise, define r/(x, 9) = 0. Also, let 


I = 4 


i: 


p 2 (x, 0 )dx 


and assume I < oo. Then, the family is q.m.d. at 9 q with quadratic mean 
derivative r/(x,9o) and constant Fisher Information I. 


The assumption that f 1 ^ 2 is absolutely continuous can be replaced by the 
assumption that / is absolutely continuous; see Hajek (1972), Lemma A.l. For 
other conditions, see Le Cam and Yang (2000), Section 7.3. 


Example 12.2.3 (Cauchy Location Model) The previous corollary applies 
to the Cauchy location model, where pe(x) = f(x — 9) and f(x) = ^ 1+ 1 j , 2 , and 
1(9) = 1/2 (Problem 12.9). ■ 


Example 12.2.4 (Double Exponential Location Model) Consider the lo¬ 
cation model pe(x) = f(x — 9) where f(x) = | exp(—1*|). Although /(•) is 
not differentiable at 0, the corollary shows the family is q.m.d. Also, 1(9) = 1 
(Problem 12.9). ■ 


Example 12.2.5 Consider the location model pg(x) = f(x — 9), where 


f(x) = C(f3) exp{—|m| /3 }, 
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where /3 is a fixed positive constant and C(/3) is a normalizing constant. By the 
previous corollary, this family is q.m.d. if (3 > In fact, one can check that 

/(*) 

if and only if (3 > | (Problem 12.10). This suggests that q.m.d. fails if /3 < \, 
which is the case; see Rao (1968) or Le Cam and Yang (2000), pp.188-190. ■ 

In the fc-dimensional case, sufficient conditions for a family to be q.m.d. in 
terms of “ordinary” differentiation can be obtained by an argument similar to 
the proof of Theorem 12.2.1. As an example, we state the following (Problem 
12.11, or Bickel, Klaassen, Ritov and Wellner (1993), Proposition 2.1). 

Theorem 12.2.2 Suppose fi is an open subset o/IR fc , and Pg has density pg(-) 
with respect to a measure p. Assume pg(x ) is continuously differentiable in 8 for 
p-almost all x, with gradient vector pg(x) (of dimension 1 x k). Let 

if pe(x) > 0 and pe(x) exists, and set p(x,8) = 0 otherwise. Assume the Fisher 
Information matrix 1(8) exists and is continuous in 8. Then, the family is q.m.d. 
with derivative r/(x,9). 



Example 12.2.6 (Exponential Families in Natural Form) Suppose 
^(x) =pe(x) = C(8)exp[(8,T(x))], 

where 

Q = int{f 1 € IR fe : J exp[( 0 , T(x))] dp(x) < oo} 

and T(x) = (T\(x),..., Tk(x)) T is a Borel vector-valued function on the space X 
where p is defined. This family is q.m.d. ■ 


Example 12.2.7 (Three-parameter Lognormal Family) Suppose Pg is the 
distribution of 7 + exp(X), where X ~ N(p, a 2 ). Here, 8 = ( 7 , p, a), where 7 and 
p may take on any real value and a any positive value. Note the support of the 
distribution varies with 8. Theorem 12.2.2 yields that this family is q.m.d., even 
though the likelihood function is unbounded. ■ 


Example 12.2.8 (Uniform Family) Suppose Pg is the uniform distribution 
on [0,0]. This family is not q.m.d., which can be seen by the fact that the 
convergence (12.6) fails for any choice of p. Indeed, for h > 0, 


* /bflo 2 (*) ~ Plilhn-i/^dx >n ^ 


@Q~\-hn 


1/2 


8 0 + hn- 1 / 2 


dx 


In fact, it is quite typical that families whose support depends on unknown 
parameters will not be q.m.d., though Example 12.2.7 is an exception. ■ 
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We are now in a position to obtain an asymptotic expansion of the loglikelihood 
ratio whose asymptotic form corresponds to that of the normal location model 
in Example 12.2.1. First, define the score function (or score vector) rj(x,9) by 


rj(x,9) 


'2y{x,9) 

Pl /2 (x) 


( 12 . 10 ) 


if po(x) > 0 and fj(x,9) = 0 otherwise. Under the conditions of Theorem 12.2.2, 
fj(x, 9) can often be computed as the gradient vector of logp6>(a;). Also, define the 
normalized score vector Z n by 


Z n = Z n ,e 0 = n 1 ^ 2 v(Xi, 9o) . (12.11) 

i= 1 


The following theorem, due to Le Cam, is the main result of this section. 


Theorem 12.2.3 Suppose {Pg, 9 £ 12} is q.m.d. at9o with derivative r/(-, 9 q) and 
12 is an open subset o/IR fc . Suppose I(9 q) is nonsingular. Fix 9 q and consider the 
likelihood ratio L ni h defined by 


T _ L n (9o + hn 1,/2 ) _i 7 ? 8 o+in->/ 2 (^') 
"■'* “ LJfifi) ~ Pe 0 (Xi) 

where the likelihood function !„(•) is defined in (12.1). 

(i) Then, as n — > oo, 


( 12 . 12 ) 


log (L n ,h) - 


(h,Z n ) — —(h,I(9o)h) 




(ii) Under P(f 0 , Z n 4 N(0,1(9 0 )) and so 

log (L„, h ) 4 N (-±{h,I(9 0 )h), (h,I(0 o )h )). 


(12.13) 


(12.14) 


Proof. Consider the triangular array Tn,i, ■ ■ • ,Y nt „, where 

1/2 / y \ 

y _ P 8 0 +/m-l/2( Xi ) , 

J- n.i — 1 • 




Note that Eg 0 (Y 2 }i ) = 1 < oo and 


log(L„,h) = 2 ^2 l°g(l + Y n ,i) ■ 

i=1 


(12.15) 


But, 


iog(i + y ) 

where r(y) —¥ 0 as y —> 0, so that 


y - \y 2 +y 2 r(y), 


n n n 

log (L n , h ) = 2 Y n ,i - y 2 ,i+ 2 Y « 2 i r(y„,i). 

i =1 i =1 i= 1 

The idea of expanding the likelihood ratio in terms of variables involving square 
roots of densities is known as Le Cam’s square root trick; see Le Cam (1969). 
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The proof of (i) will follow from the following four convergence results: 

n 

J2E eo (Y nA )->-l(h,I(e 0 )h) (12.16) 

i=1 

n pn 

- E eo (Y n ,i)\ - -( h,Z n > 4° 0 (12.17) 

i= 1 

n 1 

-? - A {h,I{e 0 )h) (12.18) 

i=l 

E^^) P A°0. (12.19) 

Once these four convergences have been established, part {it) of the theorem 

follows by the Central Limit Theorem and the facts that 

E$ 0 [{fj(Xi,Oo),h)] = 0 by Lemma 12.2.1 (i) 

and 


Varg 0 [(?)(Xi, do), h)] = (h,I(9o)h) by Lemma 12.2.2. 

(a) To show (12.16), 


Y.Ee 0 {Yn,i) 

i=1 


P^+hn-!/>(*) 




- 1 


pe 0 {x)dp{ x) 


1/2 

P 0 o +hn~ 1 / 2 


(*) -PefOzO] d P(x) 

\(v{x,9o), h)\ 2 dp{x) 


by (12.6). This last expression is equal to — |(/i, I(6o)h) by Lemma 12.2.2, and 
(12.16) follows. 

(b) To show (12.17), write 


Y n ,i = -n~ 1/ '(h,ri(X i ,e 0 ))+ n 


- 1/2 Rn{Xj) 

pT(Xi) 


( 12 . 20 ) 


- Eg 0 


where f R^{x) dp{x) —> 0 (by q.m.d.). Hence, 

n 1 n 

£F»,i -Ee 0 (Y n ,i)\ = -{h,Z n )+hn~ 1/2 Y, , 1/a , v . 

i=l i=l L^0 O 

The last term, under PJ^, has mean 0 and variance bounded by 

\Rl{x.i 


Rn(Xi) 

1 / 2 / 


i?n(XQ 




[pe 0 {Xi)\ 


= h Rn(x) dfi{x) —» 0 . 


So, (12.17) follows. 

(c) To prove (12.18), by the Weak Law of Large Numbers, under 9q, 


^EKMt-VA))] 2 4 Ee 0 {[(h,rj{ AT, So))] 2 } = (h,I(0 0 )h) . 

i= 1 


( 12 . 21 ) 
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Now using equation (12.20), we get 


E y »4 = 


n n 

1^[(M(X„»o))] 2 + E 

i =1 i= 1 


Rl(Xi) 
Pe o {Xi) 


+i.J2[(h,v(x,.,e 0 ))]J2 

i =1 1 = 1 


^n(AO) 
P« 0 /2 (^) ' 


( 12 . 22 ) 


By (12.21), the first term converges in probability under 9q to \{h,I{9o)h). The 
second term is nonnegative and has expectation under 9q equal to 


R„(x)/j,(dx) —» 0 


hence, the second term goes to 0 in probability under P B by Markov’s inequality. 
The last term goes to 0 in probability under Pg Q by the Cauchy-Schwarz inequality 
and the convergences of the first two terms. Thus, (12.18) follows. By taking 
expectations in (12.22), a similar argument shows 

nEg 0 (F„ 2 i) = i ( h , I(8 0 )h) + o(l) (12.23) 

as n —> oo, which also implies Eg 0 (Y ni ,) —> 0. 

(d) Finally, to prove (12.19), note that 


y'>nV(y»,i) < max k(^n,*)lE y ".i- 

f * l<t<n ' 

i= 1 - i= 1 

So, it suffices to show max^ \r(Y n ^)\ —> 0 in probability under 6 o, which follows 
if we can show 



(12.24) 


But, iP'n.i — Eg 0 (Y n ,i)] is asymptotically normal by (12.17) and the Central 
Limit Theorem. Hence, Corollary 11.2.2 is applicable with s 2 = 0(1), which 
yields the Lindeberg Condition 

nE eo [\Y n ,i - Eg 0 (Y n ,i)\ 2 I{\Y n ,i - E Bo (Y n ,i)\ > e}] -> 0 (12.25) 


for any t > 0. But then, 


p e o(max |Y„,i 

l<t<n 


Ee 0 (Y n ,i )| > e} < nPe 0 {\Y nyi 


Ee 0 (Y n ,i)\ 2 > e 2 } , 


which can be bounded by the expression on the left side of (12.25) divided by 
e 2 , and so maxi<;< n | Y n ^ — Eg 0 (Y nt i)\ — > 0 in probability under 9q. The result 
(12.24) follows, since Eg 0 (Y n ti ) — l 0. ■ 


Remark 12.2.1 Since the theorem concerns the local behavior of the likelihood 
ratio near do, it is not entirely necessary to assume fl is open. However, it is 
important to assume 6o is an interior point; see Problem 12.14. 
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Remark 12.2.2 The theorem holds if h is replaced by h n on the left side of 
each part of the theorem where h n —¥ h. It then follows that the left side of 
(12.13) tends to 0 in probability uniformly in h as long as h varies in a compact 
set; that is, for any c > 0, the supremum over h such that \h\ < c of the absolute 
value of the left side of (12.13) tends to 0 in probability under do ; see Problem 
13.12. 


12.3 Contiguity 

Contiguity is an asymptotic form of a probability measure Q being absolutely 
continuous with respect to another probability measure P. In order to motivate 
the concept, suppose P and Q are two probability measures on some measurable 
space (X, T). Assume that Q is absolutely continuous with respect to P. This 
means that E £ T and P(E) = 0 implies Q{E) = 0. 

Suppose T = T(X) is a random vector from X to IR fc , such as an estimator, 
test statistic, or test function. How can one compute the distribution of T under 
Q if you know how to compute probabilities or expectations under PI Specifically, 
suppose it is required to compute EQ[f(T)], where / is some measurable function 
from IR fe to 1R. Let p and q denote the densities of P and Q with respect to a 
common measure fi. Then, assuming Q is absolutely continuous with respect to 

P, 

E Q [f(T(X))\ = [ f(T{x))dQ(x) (12.26) 

J X 

= £ f(T(x))^p(x)dp(x) = E P [f (T(X))L(X)\ , (12.27) 

where L{X) is the usual likelihood ratio statistic: 

i(X, , Sffl . (12.28) 

Hence, the distribution of T(X) under Q can be computed if the joint distribution 
of (T(X),L(X)) under P is known. Let F T ' L denote the joint distribution of 
(T(X), L(X)) under P. Then, by taking / to be the indicator function f(T(X)) = 
Ib [T(X)] defined to be equal to one if T(X) falls in B and equal to zero otherwise, 
we obtain: 

Q{T(X) £ B} = f I(T(x) £ B)L(x)p(x)fj.(dx) (12.29) 

J x 

= E P [I(T(X) £ B)L(X)] = [ rdF T ’ L (t, r) . (12.30) 

Jbx IR 

Thus, under absolute continuity of Q with respect to P, the problem of finding 
the distribution of T(X) under Q can in principle be obtained from the joint 
distribution of T(X) and L(X) under P. 

More generally, if / = f(t, r) is a function from lR fe x IR. to IR, 

= f k f r )rdF T ' L (t, r) 

Jm h xIR 


E Q [f(T(X),L(X))\ 


(12.31) 
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(Problem 12.18). 

Contiguity is an asymptotic version of absolute continuity that permits an 
analogous asymptotic statement. Consider sequences of pairs of probabilities 
{Pn, Qn}, where P„ and Q n are probabilities on some measurable space (X n ,P n )- 
Let T„ be some random vector from X n to IR fc . Suppose the asymptotic distri¬ 
bution of T n under P n is easily obtained, but the behavior of T n under Qn is also 
required. For example, if T n represents a test function for testing P n versus Q n , 
the power of T n is the expectation of T n under Q n . Contiguity provides a means 
of performing the required calculation. An example may help fix ideas. 


Example 12.3.1 (The Wilcoxon Signed Rank Statistic) Let Xi ,..., X n b 

i.i.d. real-valued random variables with common density /(•)• Assume that /(•) 
is symmetric about 8. The problem is to test the null hypothesis that 8 = 0 
against the alternative hypothesis that 8 > 0. Consider the Wilcoxon signed rank 
statistic defined by: 

n 

W„ = W n (X 1 ,..., Xn) = n~ 3/2 ]T R+„sign(X i ) , (12.32) 

i =1 

where sign(A'i) is 1 if Xi > 0 and is —1 otherwise, and Rf n is the rank of \Xi\ 
among |Xi|,..., |A'„|. Under the null hypothesis, the behavior of W n is fairly easy 
to obtain. If 8 = 0, the variables sign(AL) are i.i.d., each 1 or -1 with probability 
1/2, and are independent of the variables Rf n . Hence, Ee=o{W n ) = 0. Define 
Ik to be 1 if the fcth largest |A;| corresponds to a positive observation and —1 
otherwise. Then, we have 

n 

Var B=0 (Wn) = n~ 3 Var{Y, k'h) (12.33) 

k = 1 


= n - 3 Vfc 2 = n- 3 n(n+1)(2n + 1) -+ 1 
f-' 6 3 


(12.34) 


as n oo. Not surprisingly, W„ —> N(0, |). To see why, note that 
(Problem 12.19) 

n 

W„ - n~ 1/2 Ui sign(Xi) = o P { 1) , (12.35) 

i=1 

where Ui = G(|A'i|) and G is the c.d.f. of |Aj|. But, under the null hypothesis, 
Ui and sign(Ai) are independent. Moreover, the random variables f/isign(A'i) are 
i.i.d., and so the Central Limit Theorem is applicable. Thus, W„ is asymptotically 
normal with mean 0 and variance 1/3, and this is true whenever the underlying 
distribution has a symmetric density about 0. Indeed, the exact distribution of 
W n is the same for all distributions symmetric about 0. Hence, the test that 
rejects the null hypothesis if W n exceeds 3 -1/,2 2i_ a has limiting level 1 — a. Of 
course, for finite n, critical values for W n can be obtained exactly. Suppose now 
that we want to approximate the power of this test. The above argument does 
not generalize to even close alternatives since it heavily uses the fact that the 
variables are symmetric about zero. Contiguity provides a fairly simple means of 
attacking this problem, and we will reconsider this example later. ■ 
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We now return to the general setup. 

Definition 12.3.1 Let P„ and Q n be probability distributions on (X n ,tFn)- 
The sequence { Q n } is contiguous to the sequence {P n } if Pn{E n ) —> 0 implies 
Qn(En) -> 0 for every sequence {E n } with E n £ T n . 

The following equivalent definition is sometimes useful. The sequence {Q n } is 
contiguous to {P n } if for every sequence of real-valued random variables T n such 
that T„ —>■ 0 in P n -probability we also have T n —> 0 in Q n -probability. 

If {Qn} is contiguous to {P n } and {P n } is contiguous to {Qn}, then we say 
the sequences {P n } and {Q n } are mutually contiguous, or just contiguous. 

Example 12.3.2 Suppose P n is the standard normal distribution TV (0, 1) and 
Q„ is TV(£ n , 1). Unless £„ is bounded, P n and Q n cannot be contiguous. Indeed, 
suppose » oo and consider E n = {x : \x — £„| < 1}. Then, Q„(E n ) « 0.68 
for all n, but P n (E n ) —> 0. Note that, regardless of the values of P n and Q n 
are mutually absolutely continuous for every n. ■ 

Example 12.3.3 Suppose P n is the joint distribution of n i.i.d. observations 
A'i,..., X n from N{ 0,1) and Q n is the joint distribution of n i.i.d. observations 
from N(£ n , 1). Unless £„ —» 0, P„ and Q n cannot be contiguous. For example, 
suppose > e > 0 for all large n. Let X n = n” 1 5^7=1 A) an( I consider E n = 
{X n > t/2}. By the law of large numbers, P n {E n ) — > 0 but Q n {E n ) —> 1. As will 
be seen shortly, in order for P„ and Q n to be contiguous, it will be necessary and 
sufficient for —>■ 0 in such a way so that n 1/,2 £ n remains bounded. ■ 

We now would like a useful means of determining whether or not Q n is con¬ 
tiguous to P n . Suppose P n and Q„ have densities p n and q n with respect to /r n - 
For x £ Xn, define the likelihood ratio of Q„ with respect to P„ by 

[ q tnW) if ?»(*)> 0 

L n {x) = < oo if p n (x) = 0 < q n {x) (12.36) 

[l if Pn(x) — q n {x) = 0. 

Under P n or Q n , the event {p n = q n = 0} has probability 0, so it really doesn’t 
matter how L„ is defined in this case (as long as it is measurable). Note that L n 
is regarded as an extended random variable, which means it is allowed to take on 
the value oo, at least under Q n . Of course, under P n , L n is finite with probability 
one. 

Observe that 

Ep n {Ln)= / L n {x)pn(x)p n (dx) = / q n {x) pL n {dx) 

J X n J {x: p n (cc)>0} 

= Q„{x : Pn{x) > 0} = 1 — Qn{x : p„(x) = 0} < 1 , (12.37) 

with equality if and only if Q„, is absolutely continuous with respect to P n . 

Example 12.3.4 (Contiguous but not absolutely continuous sequence) 

Suppose P n is uniformly distributed on [0,1] and Q„ is uniformly distributed on 
[0, 6 n ], where 9 n > 1. Then, Q n is not absolutely continuous with respect to P„. 
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Note that the likelihood ratio L n is equal to 1 /Q n with probability one under P n , 
and so 

EpM = ±- < 1 • 

It will follow from Theorem 12.3.1 that Q n is contiguous to P„ if 6„ —> 1. 

The notation C(T\P) refers to the distribution of a random variable (or possibly 
an extended random variable) T = T(X) when X is governed by P. Let G„ = 
C(L n \P n ), the distribution of the likelihood ratio under P„. Note that G n is a 
tight sequence, because by Markov’s inequality, 

Pn{L n >c}< Epn(Ln ' ) < 1 , (12.38) 

c c 

where the last inequality follows from (12.37). 

The statement that Ep n (L n ) = 1 implies that Q„ is absolutely continuous 
with respect to P n , by (12.37). The following result, known as Le Cam’s First 
Lemma, may be regarded as an asymptotic version of this statement. 


Theorem 12.3.1 Given P„ and Q n , consider the likelihood ratio L n defined in 
(12.36). Let G n denote the distribution of L n under P n . Suppose G„ converges 
weakly to a distribution G. If G has mean 1, then Q n is contiguous to P„. 


Proof. Suppose P n (E n ) = a n —> 0. Let <f> n be a most powerful level a n test of 
P n versus Q n - By the Neyman-Pearson Lemma, the test is of the form 


071 


f 1 if L n ^ kn 

(0 if Ln < kn, 


(12.39) 


for some k n chosen so the test is level q„. Since cj> n is at least as powerful as the 
test that has rejection region E n , 


Q 


,.{En}< J 


(j>ndQri 


so it suffices to show the right side tends to zero. Now, for any y < oo, 


/ 4>ndQn = / 4>ndQn + 

J J Lr,<y JL 


(J)ndQn 


• L n >y 


<y (j)ndP n + / dQ n <y 4>ndPn + 1 - / 
J J L ri '>y J JL 


dQn 


Ln<y 


= yOLn + 1 - 


LndPn = ya n + 1 — / xdGn{x) . 


I L n <y 


Fix any t > 0 and take y to be a continuity point of G with 


J xdG(x) > 1 - | , 

which is possible since G has mean 1. But G n converges weakly to G implies 


xdG n {x) 


xdG(x ) 


(12.40) 
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by an argument like that in Example 11.2.14 (Problem 12.27). Thus, for 
sufficiently large n, 

1 - J xdGn(x) < | 

and yo.n < e/2. It follows that, for sufficiently large n, 

J (fndQn e , 

as was to be proved. ■ 

The following result summarizes some equivalent characterizations of contigu¬ 
ity. The notation C(T\P) refers to the distribution (or law) of a random variable 
T under P. 


Theorem 12.3.2 The following are equivalent characterizations of {Q n } being 
contiguous to {Pn}- 

(i) For every sequence of real-valued random variables T„ such that T n — > 0 in 
P n -probability, it also follows that T n —> 0 in Q n -probability. 

(ii) For every sequence T n such that C{T n \P n ) is tight, it also follows that 
C(T n \Q n ) is tight. 

(in) If G is any limit point 3 of C(L n \P n ), then G has mean 1. 

Proof. First, we show that (ii) implies (i). Suppose T„ —> 0 in P n -probability; 
that is, P n {\T n \ > 5} — > 0 for every 5 > 0. Then, there exists e„ J, 0 such that 
Pn{\T n \ > e„} -¥ 0. So, |T„|/e n is tight under {P n }- By hypothesis, \T n \/e n is 
also tight under { Q n }. Assume the conclusion that T„ —» 0 in Q n -probability 
fails; then, one could find e > 0 such that Q n {\T n \ > e} > e for infinitely many 
n. Then, of course, Q n {\Tn\ > \ftn} > e for infinitely many n. Since 1 /t oo, 
it follows that \T n \/e n cannot be tight under {Q n }, which is a contradiction. 

Conversely, to show that (i) implies (ii), assume that C(T n \P n ) is tight. Then, 
given e > 0, there exists k such that P n {|T n | > k} < e/2 for all n. If £(T n \Q n ) is 
not tight, then for every j, Q n {\T n \ > j} > e for some n. That is, there exists a 
subsequence rij such that Q nj {\T nj \ > j} > e for every j. As soon as j > k, 

Pn j {\T nj \>j}<Pn i {\Tn j \>k}<^, 

a contradiction. 

To show (iii) implies (i), first recall (12.38), which implies G„ is tight. Assuming 
Pn{A„} —» 0, we must show Q„{A n } —» 0. Assume that this is not the case. 
Then, there exists a subsequence nj and t > 0 such that Qnj{A nj } > e for all nj. 
But, there exists a further subsequence nj k such that G njk converges to some G. 
Assuming (iii), G has mean 1. By Theorem 12.3.1, P, and Qn jk are contiguous. 
Since Qn jk {An jk } —^ 0, this is a contradiction. 


3 G is a limit point of a sequence G„ of distributions if G nj converges in distribution 
to G for some subsequence nj. 
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Conversely, suppose (i) and that G„ converges weakly to G (or apply the fol¬ 
lowing argument to any convergent subsequence). By Example 11.2.14, it follows 
that 

J xdG(x) < liminf Ep n (L n ) < 1 , 

so it suffices to show f xdG(x) > 1. Let f be a continuity point of G. Then, also 
by Example 11.2.14 (specifically (11.39)), 

/ xdG{x) > / xdG(x) = lim Ep n (L n l{L n < t}) = lim Q n {L n < t} . 

J J{x<t} n n 

So, it suffices to show that, given any e > 0, there exists a t such that Q„{L n > 
t} < e for all large n. If this fails, then for every j, there exists rij such that 
Qn j {L nj > j} > e. But, by (12.38), 

Pn.j{Lnj > j} < -7 —> 0 

as j —> oo, which would contradict (i). ■ 

As will be seen in many important examples, loglikelihood ratios are typically 
asymptotically normally distributed, and the following corollary is useful. 

Corollary 12.3.1 Consider a sequence {P„,Q n } with likelihood ratio L n defined 
in (12.36). Assume 

C(L n \P n ) A C(e z ) , (12.41) 

where Z is distributed as N(/j,,o 2 ). Then, Q„ and P n are mutually contiguous if 
and only if p = —<j 2 /2. 

Proof. To show Q n is contiguous to P n , apply part (iii) of Theorem 12.3.2 
by showing E(e z ) = 1. But, recalling the characteristic function of Z from 
equation (11.10), it follows that 

E(e z ) = exp(p + ^<t 2 ) , 

which equals 1 if and only if /r = — <t 2 /2. The converse follows by Problem 12.23. ■ 
We may write (12.41) equivalently as 

£(log(L„)|P„) 4 C(Z) . 

However, since P n {L n = 0} may be positive, we may have log(L«) = — oo with 
positive probability, in which case log(L n ) is regarded as an extended real-valued 
random variable taking values in IR |J{±oo}. If X n is an extended real-valued 
random variable and A is a real-valued random variable with c.d.f. F, we say (as 
in Definition 11.2.1) X n converges in distribution to X if 

Pn{X n € (—oo,t]} —> F(t ) 

whenever t is a continuity point of F. It follows that if X n converges in distribu¬ 
tion to a random variable that is finite (with probability one), then the probability 
that X n is finite must tend to 1. 
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Example 12.3.5 (Example 12.3.2, continued). Again, suppose that P„ = 
N( 0,1) and Q n = N(£ n , 1). In this case, 

L n = L n (X) = exp(£„X - ■ 


Thus, 

£(lo g (L„)|Pn) = JV^-^,^ 


Such a sequence of distributions will converge weakly along a subsequence rij if 
and only if £ nj —¥ £ (for some |£| < oo), in which case, the limiting distribution is 

N( = $~, £ 2 ) and the relationship between the mean and the variance (/r = —<r 2 /2) 
is satisfied. Hence, Q„ is contiguous to P n if and only if is bounded. ■ 


Example 12.3.6 (Example 12.3.3, continued). Suppose X\..... X n are 

i.i.d. with common distribution N(£, 1). Let P n represent the joint distribution 
when £ = 0 and let Q„ represent the joint distribution when £ = £n- Then, 

log (L n (X lt ..., X n )) = Y, X ‘ - % , (12.42) 

4 = 1 

and so 

£(log(L„)|P n ) = iV^-^,n^ . 

By an argument similar to that of the previous example, Q n is contiguous to P n if 
and only if n£ 2 remains bounded, i.e. = 0(n -1 ^ 2 ). Note that, even if —>• 0, 
but at a rate slower than n -1 ^ 2 , Q n is not contiguous to P n ■ This is related 
to the assertion that the problem of testing P n versus Q n is degenerate unless 
£ n x •nT 1 ^ 2 , in the sense that the most powerful level a test <j> n has asymptotic 
power satisfying E^ n (</> n ) —>■ 1 if n 1,/2 \^„\ oo and E^ n (rf)„) —> a if n 1 ^ 2 ^ n —¥ 0. 4 
Indeed, suppose without loss of generality that > 0. Then, the most powerful 
level a test rejects when n}^ 2 X n > zi- a , where X n = J2 i=1 Xi/n and z\- a 
denotes the 1 — a quantile of the standard normal distribution. The power of rf)„ 
against £„ is then 

Pu{n 1/2 X n > Zl - a } = P u {n 1/2 (X n - in) > 2 i_ a - n 1/2 in} 

= P{Z> Zl - a -n 1/2 in }, 

where Z is a standard normal variable. Clearly, the last expression tends to 1 if 
and only if n 4 / 2 £ n —> oo; furthermore, it tends to a if and only if n 1/,2 £„ —> 0. 
The limiting power is bounded away from a and 1 if and only if x n -1 ^ 2 . ■ 


Example 12.3.7 (Q.m.d. families) Let { Pg , 6 £ 12} with Q an open subset 
of IR fc be q.m.d., with corresponding densities pg (-). By Theorem 12.2.3, under 


4 Two real-valued sequences {a n } and {bn} are said to be of the same order, written 
a„ X b n if |tin/ bn | is bounded away from 0 and oo. 



12.3. Contiguity 499 


So, 

dP n n i 

i 0 g( 9o ;;r /2 ) = n - i/2 j>,^ *°)> - ^+ OP n (i), (12.43) 

r 6o i =1 

where fj(x,8) = 2r/(x, 8) /p 1 ^ 2 (x), r/(-,8) is the quadratic mean derivative at 8, 
and 1(9) is the Information matrix at 9. Hence, by Corollary 12.3.1, Pg +hn - 1/2 
and Pg 0 are mutually contiguous. ■ 


Suppose Q„ is contiguous to P„. As before, let L n be the likelihood ratio 
defined by (12.28). Let T n be an arbitrary sequence of real-valued statistics. The 
following theorem allows us to determine the asymptotic behavior of (T„,L n ) 
under Q n from the behavior of (T n ,L n ) under P n . 

Theorem 12.3.3 Suppose Q„ is contiguous to P n . Let T n be a sequence of real¬ 
valued random variables. Suppose, under P n , ( T n ,L n ) converges in distribution to 
a limit law F(-, ■); that is, for any bounded continuous function f on (— 00 , 00 ) x 
[0, 00 ), 

E Pn [f(T n ,L n )]^ 11 f(t, r)dF(t, r) . (12.44) 

Then, the limiting distribution of (T n , L n ) under Q n has density rdF{t,r); that 
is, 

E Qn (f(T n ,L n )\^ 11 f(t,r)rdF(t,r) (12.45) 

for any bounded continuous f. Equivalently, if under P„ (T n , log(L n )) converges 
weakly to a limit law F(-,-), then 

E Qn [f(T n ,lo g (L n ))] -»• 11 f(t,r)e r dF(t,r) (12.46) 

for any bounded continuous f. 

Note that equation (12.45) is simply an asymptotic version of (12.31). 

Remark 12.3.1 The result is also true if T n is vector-valued, and the proof is 
the same. 

Proof. Let F n = C((T n , L n )\P n ) and G n = C((T n , L n )\Q n ). Since L n converges 
in distribution under P„, contiguity and Theorem 12.3.2 (iii) imply that 

J rdF(t , r) = 1 . 

Thus, rdF(t,r ) defines a probability distribution on (— 00 , 00 ) x [0, 00 ). 

Let / be a nonnegative, continuous function on (— 00 , 00 ) x [0,oo]. By the 
Portmanteau Theorem (Theorem 11.2.1 (vi)), it suffices to show that 

liminf J f(t,r)dG„(t,r)> J f(t,r)rdF(t,r) . 
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Note that 


f(t,r)dG n (t,r) = E Qn [f(T n ,L n )] = / f(T n ,L n )dQ. 


-/■ 


> f f(T n , L n )dQ n = f f(T n ,L n )L n d,P n = f f(t,r)rdF n (t,r) . 

J {pn > 0 } J J 


So, it suffices to show 


f(t,r)rdF n (t,r) > J rf(t,r)dF n (t,r) . 

But, rf(t, r) is a nonnegative, continuous function, and so the result follows again 
by the Portmanteau Theorem. ■ 


lim inf 


The following special case is often referred to as Le Cam’s Third Lemma. 


Corollary 12.3.2 Assume that, under P n , (T n ,log(L n )) -4 ( T,Z ), where ( T,Z ) 
is bivariate normal with E(T) = pi, Var(T) = of, E(Z) = p 2 , Var(Z) = of and 
Cov(T,Z) = 0 - 1 , 2 - Assume p 2 = — 0 - 2 / 2 , so that Q n is contiguous to P n . Then, 
under Q n , T„ is asymptotically normal: 

T-(Tn\Qn) —t -N(pi + 0-1,2,0"i) . 

Proof. Let F{-,-) denote the bivariate normal distribution of ( T,Z ). By The¬ 
orem 12.3.3, the limiting distribution of C{T n \Q n ) has density e r dF(x,r)\ let T 
denote a random variable having this distribution. The characteristic function of 
T is given by: 

E(e iXf ) = J e iXx e r dF{x,r) = E{e iXT+z ) , (12.47) 

which is the characteristic function of (T, Z) evaluated at t = (ti,t 2 ) T = (A, —i) T - 
By Example 11.2.1, this is given by 

ex P(*(p, t) - ^(Et,t)) =exp(*piA + p 2 - |(E(A,-f) T ,(A,-*) T )) 


1 2 2 02 

= exp(ipiA + P 2 — 2 * + Aio-i ,2 + = exp[i(pi 


-1- 0-1, 2 jA 


the last equality following from the fact that p 2 = —o\/2 (by contiguity). But, 
this last expression is indeed the characteristic function of the normal distribution 
with mean pi + 0 - 1,2 and variance of. ■ 


Example 12.3.8 (Asymptotically Linear Statistic) Let {Pg, 0 6!!} with 
SI an open subset of lR fe be q.m.d., with corresponding densities pg{-)- Recall 
Example 12.3.7, which shows that Pg Q+hn ~ 1/2 and Pg Q are mutually contiguous. 
The expansion (12.43) shows a lot more. For example, suppose an estimator 
(sequence) 6„ is asymptotically linear in the following sense: under do, 

n 

- 1 / 2 £lM* i)+ Opn(l) , 

i =1 


n 1/2 (d„ — 6 0 ) = n 


(12.48) 
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where Eg 0 [ipe 0 (Xi)] = 0 and r 2 = Varg 0 [ipg 0 (X i)] < oo. Thus, under do, 
n 1/2 (L-d 0 )^ N(0,r 2 ) . 

Then, the joint behavior of 6 n with the likelihood ratio satisfies 


(n 1/2 (d n 


Oo), 


dP, 


Oo + hn - 1 / 2 ■ 

dP" ' 


(12.49) 


™ 1 

= {n- 1/2 J2(^o(x i ),(h,v(x t ,eoM + (o,--(h,i(eo)h)) + op^(i). 

i=1 

By the bivariate Central Limit Theorem, this converges under do to a bivariate 
normal distribution with covariance 

a lt 2 = Cove 0 (ipe 0 {X 1 ),(h,fj(X i ,do))) . (12.50) 

Hence, under Pg +hn - 1 / 2 , n 1 ^ 2 {0 n — do) converges in distribution to N(ai : 2 , t 2 ), 
by Corollary 12.3.2. It follows that, under Pg Q+hn - 1 / 2 , 

n 1/2 (0„ - {d 0 + hn~ 1/2 )) 4 N{a h2 - h,r 2 ) . ■ 


Example 12.3.9 (t-statistic) Consider a location model /(* — d) for which 
f(x) has mean 0 and variance ct 2 , and which satisfies the assumptions of Corollary 
12.2.1, which imply this family is q.m.d. For testing d = do = 0, consider the 
behavior of the usual t-statistic 


tn 



n 1/2 X„ 


+ °Pe 0 (!) • 


Then, (12.48) holds with ipg 0 (Xi) = Xi/a. We seek the behavior of t„ under 
d n = h/n 1 ^ 2 . Although this can be obtained by direct means, let us obtain the 
results by contiguity. Note that (12.43) holds with 


fj(Xi,d 0 ) 


f'(x) 

f(x) 


Thus, <7i,2 in (12.50) reduces to 


<71,2 —- Covg 0= 0 

<7 


Xi 


fVQ 

f(Xi) 


- [ xf (x)dx = - . 

O J -00 a 


Hence, under d n = h/n 1 ^ 2 , 

t„ 4 N( — , 1) . ■ 

<7 


Example 12.3.10 (Sign Test) As in the previous example, consider a location 
model f(x — d), where / is a density with respect to Lebesgue measure. Assume 
the conditions in Corollary 12.2.1, so that the family is q.m.d. Further suppose 
that f(x) is continuous at x = 0 and Pg =0 {Xi > 0} = 1/2. For testing d — do = Q, 
consider the (normalized) sign statistic 

n 1 

Sn=n~ 1/2 J2lI{X t >0}- -] , 

i =1 
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where I{Xi > 0} is one if Xi > 0 and is 0 otherwise. Then, (12.48) holds with 
i>o (Xi) = I{Xi > 0} — \ and so 

Sn 4 JV(0, \). 

Under 9 n = h/n 1 ^ 2 , S n —> N(ai t2 , 1/4), where cti ,2 is given by (12.50) and equals 


cti ,2 = —hCovo 


I{Xi > 0} 


/W 


f(Xi 


POO 

-h / f'(x)dx = hf( 0) . 
Jo 


Hence, under 9 n = h/n 1 ^ 2 , 


S n 4 N(hf{ 0), i). 


Example 12.3.11 (Example 12.3.1, continued). Recall the Wilcoxon signed 
rank statistic W„ given by (12.32). For illustration, suppose the underlying den¬ 
sity /(•) of the observations is normal with mean 9 and variance 1. Under the null 
hypothesis 9 = 0, W„ is asymptotically normal N( 0, |). The problem now is to 
compute the asymptotic power against the sequence of alternatives 9 n = h/n 1 ^ 2 
for some h > 0. Under the null hypothesis, by (12.35) and (12.42), 

n n ,2 

(W„, log(L„)) = (rU 1/2 V Ui sign(X0, hn- 1/2 V Xi - ^-) + o P n (1) , (12.51) 

i =1 i= 1 

where Ui = G(|A'i|) and G is the c.d.f. of |X;|. This last expression is 
asymptotically bivariate normal with covariance under 9 = 0 equal to 


ai ,2 = ftCou 0 [G(|AT|)sign(Xi),Xi] = fc£k[G(|Xi|)|Ar|I , (12.52) 

and thus cti ,2 is equal to h/y/n (Problem 12.28). Hence, under 9 n = h/n 1 ^ 2 , W n is 
asymptotically normal with mean h/\Jn and variance 1/3. Thus, the asymptotic 
power of the test that rejects when W„ > 3~ 1 ^ 2 zi- a is 

lim Pe n {W n - J= > 3- 1/2 «i_ q - -£=} = 1 - $>( Zl _ a - (3/tt ) 1/2 h) , 

n—ioo yj 7T yj 7T 


where $(•) is the standard normal c.d.f. 

More generally, assume the underlying model is a location model f(x — 9), 
where f(x) is assumed symmetric about zero. Assume f'(x) exists for Lcbesgue 
almost all x and 


0 < / 


Ifjx)] 2 

/(*) 


dx < oo . 


Then, by Corollary 12.2.1, this model is q.m.d. and (12.43) holds with 


fj(x, 0) 


f'jx) 

fix) 


Under the null hypothesis 9 = 0, W n 4 N{ 0,1/3), as in the normal case. Under 
the sequence of alternatives 9 n = h/n 1 ^ 2 , 

W n 4 N(a 1}2 , i) , 
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where < 71,2 is given by (12.50). I 11 this case, 

(Ji ,2 = Cove=o[U sign(A), -h} , 

where U = G(|Aj) and G is the c.d.f. of |Aj when A' has density /(•). So, 
G(x) = 2 Fix) — 1, where F is the c.d.f. of A'. By an integration by parts (see 
Problem 12.29), 

ai.a « -hE e=0 [G(\X\)sign{X)t^-] = 2h [ f 2 (x)dx . (12.53) 

J / J — OO 

Thus, under 9 n = h/n 1 ^ 2 , 

/ OO | 

f 2 (x)dx, -) . ■ 

-OO ^ 


Example 12.3.12 (Neyman-Pearson Statistic) Assume {Pe, 9 € ff} is 

q.m.d. at 9q, where is an open subset of IR fc and I(9q) is nonsingular, so 
that the assumptions behind Theorem 12.2.3 are in force. Let pe(-) be the cor¬ 
responding density of Pe- Consider the likelihood ratio statistic based on n i.i.d. 
observations Ai,..., X n given by 


T _ dP e 0 + hn~ 1 / 2 _ y~[ Peo + hn- 1 / 2 ( X i) 

n ’ h ~ dpp ' 11 


Pe 0 {Xi) 


(12.54) 


By Theorem 12.2.3, under Pg 0 , 

log (L„, h )4lV(-^,a2), (12.55) 

where a\ = (h, I(9o)h). Apply Corollary 12.3.2 with T„ = log(L„,i), so that 
T = Z and ai ,2 = a 2 . Then, under Pg Q+hn - 1 / 2 , log (L n ,h) is asymptotically 

N(^-,a 2 ). Hence, the test that rejects when log {L n: h) exceeds —\uh + zi-a&h 
is asymptotically level a for testing 9 = 9o versus 9 = 9o + hn _1//2 , where zi- a 
denotes the 1 —a quantile of N( 0,1). Then, the limiting power of this test sequence 
for testing 9 = 9q versus 9 — 9q + hn~ x ^ 2 is 1 — 4>(«i_ a — ah) (Problem 12.30). ■ 


12.4 Likelihood Methods in Parametric Models 

The goal of this section is to study some classical large sample methods based on 
the likelihood function. The classical likelihood ratio test, as well as the tests of 
Wald and Rao will be introduced, but optimality of these tests will be deferred 
until the next chapter. Throughout this section, we will assume that Xi ,..., X„ 
are i.i.d. with common distribution Pe , where 9 £ 0 and 12 is an open subset 
of IR fc . We will also assume each Pe is absolutely continuous with respect to a 
common cr-finite measure /./, so that pe denotes the density of Pe with respect to 
/.(. The likelihood function is defined by 

n 

L n {0) =Y[pe(Xi) . 

i= 1 


(12.56) 
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It is thus the (joint) probability density of the observations at fixed values of 
Xi,...,X n , viewed as a function of 9. Note that, for the sake of simplicity, 
the dependence of L n (9) on X \,..., X n has been suppressed. (In the case that 
Xi ,..., X n are not i.i.d., L n {6) is modified so that the joint density of the X*’ s 
is used rather than the product of the marginal densities.) 


12.4.1 Efficient Likelihood Estimation 

In preparation for the construction of reasonable large sample tests and confidence 
regions, we begin by studying some efficient point estimators of 0 which will serve 
as a basis for such tests. If the likelihood L„(9) has a unique maximum 0„, then 
0 n is called the maximum likelihood estimator (MLE) of 0. If, in addition, L n (9) 
is differentiable in 0, 9 n will be a solution of the likelihood equations 

J-logL n (0) = O j = l,...,k. 


Example 12.4.1 (Normal Family) Suppose Xi,... ,X n is an i.i.d. sample 
from with both parameters unknown, so 9 = (/r,cr 2 ) T . In this case, 

the log likelihood function is 

logL„(/r,a 2 ) = -d log( 27 r) - nlog(cr) - ^ ^(Xi “ ^ ’ 

i =1 

and the likelihood equations reduce to 

n 

i= 1 


and 


n 1 

LLE + LIE 


^( x ,- M ) 2 = o . 

i= 1 


These equations have a unique solution, given by the maximum likelihood esti¬ 
mator (/r n ,d 2 ), where jd n = Xn is the usual sample mean and <r 2 is the biased 
version of the sample variance given by 

al=n~ 1 Y J l x i-X n ) 2 

(Problem 12.35). By the weak law of large numbers, X n —> /r in probability; 
by Example 11.2.6, d 2 —» a 2 in probability as well. A direct argument easily 
establishes the joint limiting distribution of the MLE. First note that 

n 

n 1/2 [a 2 - n -1 ^(Xj - ^) 2 ] = n 1/2 (X„ - /r) 2 -5- 0 

i=l 


since n 1 ^ 2 (X„ — fj.) is N(0,a 2 ) and X n — /.t 0. Hence, by Slutsky’s Theorem, 

n 1 ^ 2 ((X ri ,<f 2 ) T — (/r,cr 2 ) T ) has the same limiting distribution as 

n 

n 1/2 [(X„,n _1 £(* - m) 2 ) T - (^,cr 2 ) T ] , 

i =1 
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which by the multivariate CLT tends in distribution to N( 0, E), where E is the 2x 
2 diagonal matrix with (i,j) entry crij given by ai,i = a 2 and <72,2 = V ar[(X 1 — 
/z) 2 ] = 2<r 4 . In fact, E = I^ 1 (9) in this case. ■ 

Example 12.4.2 (MLE for a one-parameter exponential family) Suppos 
Xi,...,X n is an i.i.d. sample from a one-parameter exponential family with 
common density with respect to a cr-finite measure /r given by 

pg(x) = exp (9T(x) — A(9)} . 

Here, 9 is assumed to be an interior point of the natural parameter space. From 
Problem 2.16, recall that Eg[T(Xi)\ = A'(9) and Vare[T(Xi)] = A"{9). To show 
the maximum likelihood estimator is well-defined and to find an expression for 
it, we examine the derivative of the log of L n (9), which is equal to 

i=1 

The likelihood equation sets this equal to zero, which reduces to the equation 
T n = A!(9), where T n = n _1 YL7=i T(Xi). Hence, the MLE is found by equat¬ 
ing the sample mean of the T(A';) values to its expected value. Assuming the 
equation T n = A!(9 ) can be solved for 9, it must be the maximum likelihood 
estimator. Indeed, the second derivative of the log likelihood is —nA"(9) < 0, 
which also shows there can at be at most one solution to the likelihood equation. 
Furthermore, by the law of large numbers, T n A A'(9), which combined with 
the fact that A"(9) > 0 yields that, with probability tending to one, there ex¬ 
ists exactly one solution to the likelihood equation. Thus, 9 n is well-defined with 
probability tending to one. To determine its limiting distribution, first note that 

n 1/2 [f n -A'(9)\^N(0,A"(9)) , 

by the Central Limit Theorem. Since A! is strictly increasing, we can define the 
inverse function B of A' so that B(A'(9)) = 9. Then, 9 n = B(A'(9 n )) = B(T n ). 
By the delta method, 

n 1/2 {9 n -9)AN(0,T 2 ) , 

where 

t 2 = A"{9)[B'{A'(9))] 2 . 

But using the chain rule to differentiate both sides of the identity B(A'(9)) = 9 
yields B\A'(9))A"{9) = 1, so that 

0,^) . 

In fact, the asymptotic variance [A"(0)] _1 is I^ 1 (9), where 1(9 ) is the Fisher 
Information. ■ 

Problem 12.37 generalizes the previous example to multiparameter exponential 
families. 

The general theory of asymptotic normality of the MLE is much more difficult 
and we shall here only give a heuristic treatment. For precise conditions and 
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rigorous proofs, see Lehmann and Casella (1998), Chapter 6 and Ibragimov and 
Has’minskii (1981), Section 3.3. Let Xi,..., X n be i.i.d. according to a family 
{P@} which is q.m.d. at do with nonsingular Fisher Information matrix I(do) and 
quadratic mean derivative r/(-,do). Define 


r L n (d 0 + hn~ 1/2 ) 

n ' h L n (d 0 ) 

(12.57) 

By Theorem 12.2.3, 


log(L„ lh ): = (h, Z n ) - i(ft, I(do)h) + OP n (1) , 

(12.58) 

where Z n is the normalized score vector 


n 

Zn = Zn(do) = 2n- 1/2 Y J HXi,do)/p\ / 0 2 (X i )\ 

(12.59) 


i=l 


and satisfies, under do, 

Z n 4iV(0,/(<9o)) • 

Note that Z n = Z n (9o) depends on do, but we will usually omit this dependence 
in the notation. 

If the MLE d n is well-defined, then d n = do + hnU^ 1 ^ 2 , where h n is the value 
of h maximizing L U} h- The result (12.58) suggests that, if do is the true value, h n 
is approximately equal to h n which maximizes 

log(in,h) = ( h, Z n ) - i (h, I(do)h) . (12.60) 

Since log(Z,„,4 is a simple (quadratic) function of h, it is easily checked (Problem 
12.44) that 

h n = r 1 (d Q )Z n . (12.61) 

It then follows that 

n 1/2 (L - do) =hn~h n = r 1 (9o)Z n 4 JV( 0,1^ (9 0 )) . 

The symbol ss is used to indicate an approximation based on heuristic consider¬ 
ations. Unfortunately, the above approximation is not rigorous without further 
conditions. In fact, without further conditions, the maximum likelihood estimator 
may not even be consistent. Indeed, an example of Le Cam (presented in Exam¬ 
ple 4.1 of Chapter 6 in Lehmann and Casella (1998)) shows that the maximum 
likelihood estimator d n may exist and be unique but does not converge to the 
true value d in probability (i.e., it is inconsistent). Moreover, the example shows 
this can happen even in very smooth families in which good estimators do exist. 
Rigorous conditions for the MLE to be consistent were given by Wald (1949), 
and have since then been weakened (for a survey, see Perlman (1972)). Cramer 
(1946) derived good asymptotic behavior of the maximum likelihood estimator 
under just certain smoothness conditions, often known as Cramer type conditions. 
Furthermore, he gave conditions under which there exists a consistent sequence 
of roots d n of the likelihood equations (not necessarily the MLE) satisfying 

n /2 (d n - do) = r 1 (do)z n + 0P n (1) , 


(12.62) 
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from which asymptotic normality follows. Cramer’s conditions required that the 
underlying family of densities were three times differentiable with respect to 9, as 
well as further technical assumptions on differentiability inside the integral signs; 
see Chapter 6 of Lehmann and Casella (1998). Estimators satisfying (12.62) are 
called efficient. In the case where 9 n is a solution to the likelihood equations, it 
is called an efficient likelihood estimator (ELE) sequence. 

Determination of an efficient sequence of roots of the likelihood equations tends 
to be difficult when the equations have multiple roots. Asymptotically equivalent 
estimators can be constructed by starting with any estimator 9 n that is r4 2 - 
consistent, i.e. for which n 1 ' /2 (0„ — 9) is bounded in probability. The resulting 
estimator can be taken to be the root closest to 9 n , or an approximation to it 
based on a Newton-Raphson linearization method; for more details, see Section 
6.4 of Lehmann and Casella (1998), Gan and Jiang (1999) and Small, Wang 
and Yang (2000). A similar, but distinct, approach based on discretization of an 
initial estimator, leads to Le Cam’s (1956, 1969) one-step maximum likelihood 
estimator, which satisfies (12.62) under fairly weak conditions. 

If 9„ is any estimator sequence (not necessarily the MLE or an ELE) which 
satisfies (12.62), it follows that, under 9q, 

n 1/2 {9 n -9 0 )AN{0,r 1 (9 0 )) . 

For the remainder of this section, we will assume such an estimator sequence 9 n 
is available, by means of verification of Cramer type assumptions presented in 
Lehmann and Casella (1998), or by direct verification as in the case of exponential 
families of Example 12.4.2 and Problem 12.37. For testing applications, it is also 
important to study the behavior of the estimator under contiguous alternatives. 
The following theorem assumes the expansion (12.62) (which is only assumed to 
hold under 9o) in order to derive the limiting behavior of 9 n under contiguous 
sequences 9 n . 

Theorem 12.4.1 Assume Xi,...,X n are i.i.d. according to a q.m.d. model 
{Pg, 8 £ 11) with nonsingular Information matrix 1(9), 9 G It, an open sub¬ 
set of IR fc . Suppose an estimator 9 n has the expansion (12.62) when 9 = 9o- Let 
9„ = 9o + h n n~ x ^ 2 , where h„ —» h, £ IR fc . Then, under Pg n , 

n 1/2 (§n - 9 n ) 4 N{ 0, r 1 (9o)) ; (12.63) 

equivalently, under Pg' n , 

n 1/2 (9 n -9 0 )A-N(h,r 1 (9 0 )) . (12.64) 

Furthermore, if g(9) is a differentiable map from D to IR with nonzero gradient 
g(9) of dimension lx k, then under Pg n , 

n /2 (g(dn) - g(9n)) 4 N( 0, a 2 g 0 ) , (12.65) 

where 

a 2 g 0 =g(9o)r 1 (9o)g(9o) T . (12.66) 

Proof. We prove the result in the case h n = h, the more general case deferred 
to Problem 13.13. We will first show (12.64). By the Cramer-Wold device, it is 
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enough to show that, for any t £ H4 , under Pg , 

{n 1/2 {§ n -e 0 ),t) A N{(h,t),(t,r\e 0 )t)). 

By the assumption (12.62), we only need to show that, under Pg n , 

{r 1 (e 0 )z n , t) 4 N((h, t), {t , r\e 0 )t)). 

By Example 12.3.7, Pg n is contiguous to Pg Q , so we can apply Corollary 12.3.2 
with T n = (I^ 1 (9 0 )Z n ,t). Then, 

(T„,log(L„ lh ) = (( r 1 (9 0 )Z n ,t),(h,Z n > - i(h,/(»„)/»))+ op»( 1) . 

But, under do, Z n converges in law to Z, where Z is distributed as N(0,1(6o)). 
By Slutsky’s Theorem and the Continuous Mapping Theorem (or the bivariate 
Central Limit Theorem), under 9o, 

(T„,log(L„,0) 4 ((r 1 (9o)Z,t),(h,Z) - ±(h,I(9o)h)) - 

This limiting distribution is bivariate normal with covariance 

a 1?2 = Cov({r 1 (9 0 )Z,t),{h, Z)) = E[(h T Z)(r 1 (9 0 )Zft} 

= h T E(Z 1 zT)r 1 (0 o )t = h T I{9 0 )r\9 0 )t = (h,t) . 

The result (12.64) follows from Corollary 12.3.2. The assertion (12.65) follows 
from (12.63) and the delta method. ■ 

Under the conditions of the previous theorem, the estimator sequence g(9 n ) 
possesses a weak robustness property in the sense that its limiting distribution 
is unchanged by small perturbations of the parameter values. In the literature, 
such estimator sequences are sometimes called regular. 

Corollary 12.4.1 Assume X\,... ,X n are i.i.d. according to a q.m.d. model 
{Pg , 9 £ fl} with normalized score vector Z n given by (12.59), nonsingular In¬ 
formation matrix 1(9), 6 £ D, an open subset of IR fc . Let 9 n = 9o + h n n 
where h n —> h £ IR fc . Then, under Pg n , 

Z n 4 N(I(6o)h,I(9o)) . (12.67) 

The proof is left as an exercise (Problem 12.38). 


12-4-2 Wald Tests and Confidence Regions 

Wald proposed tests and confidence regions based on the asymptotic distribu¬ 
tion of the maximum likelihood estimator. In this section, we introduce these 
methods and study their large sample behavior; some optimality properties will 
be discussed in Sections 13.3 and 13.4. We assume 9 n is any estimator satisfying 
(12.62). Let g(9) be a mapping from D to the real line, assumed differentiable 
with nonzero gradient vector g(9) of dimension 1 x k. Suppose the problem is to 
test the null hypothesis g(9) = 0 versus the alternative g(9) > 0. Let 9q denote 
the true value of 9. Under the assumptions of Theorem 12.4.1, under 9q, 

n 1/2 [g(e n ) — g(9o)\ 4 N(0,aj o ) , 
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where 

°o 0 = g(do)i~ 1 {do)g(do) T ■ 

Assuming that g(-) and /(•) are continuous, the asymptotic variance can be 
consistently estimated by 

a 2 n = g{L)r\e ri )g{e n ) T . 

Hence, the test that rejects when 

n 1/2 g{9n) > VnZl- a 

is pointwise asymptotically level a. 

We can also calculate the limiting power against a sequence of alternatives 
0 n = 9 o + Assume g{Oo) = 0. Then, 

Pe n {n 1/2 g(0 n ) > a n zi- a j = Po„{n 1/2 [g(0 n ) - g(0 n )] > ct n zi- a - n 1/2 g(0 n )} ■ 

By Theorem 12.4.1, n 1 ^ 2 [p(0„) — g{0 n )] is asymptotically A r (0,cr| o ), under 0„. 
Also, a„ crg 0 in probability under 0 n (since this convergence holds under 9q 
and therefore under 6 n by contiguity). Finally, n 1 ^ 2 g{6 n ) —> g(0o)h. Hence, the 
limiting power is 

Pe n { nl/2 g(0n) > OnZi- a } = 1 - $(zi- a - ag^g{0o)h) . ( 12 . 68 ) 

Similarly, a pointwise asymptotically level 1 — a level confidence interval for 
g{6) is given by 

g(0 n ) ± z 1 _^n~ 1/2 a n ■ 


Example 12.4.3 (Normal Coefficient of Variation) Let Xi ,..., X n be i.i.d 
N(g,,a 2 ) with both parameters unknown, as in Example 12.4.1. Consider infer¬ 
ences for g((/x, cr 2 ) T ) = fJ-/cr, the coefficient of variation. Recall that a uniformly 
most accurate invariant one-sided confidence bound exists for g/<J\ however, it is 
quite complicated to compute since it involves the noncentral t-distribition and 
no explicit formula is available. However, a normal approximation leads to an 
interval that is asymptotically valid. Note that 


g((g,a 2 ) T ) 


& 



By Example 12.4.1, n 1 ^ 2 [(V n , S 2 ) T — (/r, cr 2 ) T ] is asymptotically bivariate normal 
with asymptotic covariance matrix E, where E is the diagonal matrix with (1,1) 
entry <r 2 and (2,2) entry 2<r 4 . Then, the delta method implies that 


Thus, the interval 


X n , -1/2/-. , X n , 

& ±n <1+ 2® ) *-5 


is asymptotically pointwise level 1 — a. ■ 
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Consider now the general problem of constructing a confidence region for 9, 
under the assumptions of Theorem 12.4.1. The convergence 

n 1/2 (L-6)AN(0,r 1 {d)) (12.69) 

implies that 

i 1/2 (e)n 1/2 {L-e)AN(o,i k ), 

the multivariate normal distribution in IR fc with mean 0 and identity covariance 
matrix Ik- Hence, by the Continuous Mapping Theorem 11.2.13 and Example 
11 . 2 . 8 , 

n(9 n -9) T I(9)(9 n -9)A X l , 

the Chi-squared distribution with k degrees of freedom. Thus, a pointwise 
asymptotic level 1 — a confidence region for 9 is 

{9 : n(9 n - 9 ) t I(9)(9„ - 9) < c M _„} , (12.70) 

where Ck,i- a is the 1 — a quantile of Xk- In (12.70), 1(9 ) is often replaced by a con¬ 
sistent estimator, such as /(#„) (assuming /(•) is continuous), and the resulting 
confidence region is known as Wald’s confidence ellipsoid. 

By the duality between confidence regions and tests, this leads to an asymptotic 
level a test of 9 = 6>o versus 9 9q, known as Wald tests. Specifically, for testing 

9 = 6q versus 9 9o, Wald’s test rejects if 

n(9 n — 9o)I(9 n ){9 n — do) > Ck,i- a • (12.71) 

Alternatively, I(9 n ) may be replaced by I(9o) or any consistent estimator of 
I(9 0 ). Under 9 n = 9o + hn ~ x ^ 2 , the limiting distribution of the Wald statistic 
given by the left side of (12.71) is Xk(\I X ^ 2 (9o)h\ 2 )^ the noncentral Chi-squared 
distribution with k degrees of freedom and noncentrality parameter |J 1 ' 2 (#o)h| 2 
(Problem 12.45). 

More generally, consider inference for g(9), where g = (<?i,... ,g q ) T is a map¬ 
ping from lR fe to IR 9 . Assume <j; is differentiable and let D = D(9) denote the 
q x k matrix with (i,j) entry dgi(y i,..., Uk)/dyj evaluated at 9. Then, the Delta 
Method and (12.69) imply that 

n 1/2 [g(L) - g(9)} 4 N( 0, V(9)) , (12.72) 

where V(9) = D(9)I~ 1 (9)D T (9). Assume V(9) is positive definite and continuous 
in 9. By the Continuous Mapping Theorem, 

n[g(9 n ) - - g(9) ] 4 xt ■ 

Hence, a pointwise asymptotically level 1 — a confidence region for g(9) is 
{9: n[g(9„) - g(9)] T V~ 1 (9 n )[g(9 n ) - g(9)] < XqC 1 - <*)} . 

Next, suppose it is desired to test g(9) = 0. The Wald test rejects when 

W n =ng(9 n )V- l (9„)g T (9n) 

exceeds \q( 1 — a ): all( l it is pointwise asymptotically level a. 
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12-4-3 Rao Score Tests 

Instead of the Wald tests, it is possible to construct tests based directly on Z n in 
(12.59), which have the advantage of not requiring computation of a maximum 
likelihood estimator. Assume q.m.d. holds at do, with derivative rj{-,do) and, as 
usual, set 

V(x,0 0 ) = 2ri{x,e 0 )/p 1 g / 0 2 (x) . 

Under the assumptions of Theorem 12.2.2, the quadratic mean derivative r;(-, do) 
is given by (12.9) and n 1/l2 Z n can then be computed by 

n 

n 1/2 z n = Y,v(Xi-e 0 ) = 

i =1 

(12.73) 

As mentioned earlier, the statistic Z n is known as the normalized score vector. 
Its use stems from the fact that inference can be based on Z n , which involves 
differentiating the log likelihood at a single point do, avoiding the problem of max¬ 
imizing the likelihood. Even if the ordinary differentiability conditions assumed 
in Theorem 12.2.2 fail, inference can be based on Z„, as we will now see. 

Suppose for the moment that d is real-valued and consider testing d = do versus 
d > do- For a given test 4> = <j>{ AT,..., X n ), let 

M0) = E e [<KX 1 ,...,X n )] 

denote its power function. By Problem 12.17, assuming q.m.d., P^(d) is 
differentiable at do with 

n n 

<j>{x 1 ,... ,x„) do)Y\_pe 0 (,Xi) p(dx\) ■ ■ ■ p (dx n ) ■ 

i=1 i=1 

Consider the problem of finding the level a test (j> that maximizes /3j,(0o)- By the 
general form of the Neyman-Pearson Lemma, the optimal test rejects for large 
values of JT v(Xi,do), or equivalently, large values of Z n . By Problem 8.2, if this 
is the unique test maximizing the slope of the power function at #o, then it is 
also locally most powerful. Thus, tests based on Z n are appealing from this point 
of view. 

We turn now to the asymptotic behavior of tests based on Z n . Assume the 
assumptions of quadratic mean differentiability hold for general k, so that under 
do, 

Z„4 JV(O,/(0 o )) • 

By Corollary 12.4.1, under d n — do + hn _1 ^ 2 , 

Z n AN{I(do)h,I(do )) . 

It follows that, under d n — do + hn ~ 1,/2 , 

r 1/2 (d 0 )z n 4 N(T /2 (d 0 )h,h). 


P'40o) = I ■■■ I 


' PQq (Xi 
’ PBo {Xi 


= {-QQ-^OgL n {d),..., 


log L n {d))\ 


(12.74) 



512 


12. Quadratic Mean Differentiable Families 


Now, suppose k — 1 and the problem is to test 9 = 9q versus 9 > 9q. Rao’s 
score test rejects when the one-sided score statistic I -1 ' 2 (9o)Z n exceeds z\- a 
and is asymptotically level a. In this case, the Wald test that rejects when 
I 1 ^ 2 (9o)n 1 ^ 2 (0„ — 9 o) exceeds zi- a and the score test are asymptotically equiva¬ 
lent, in the sense that the probability that the two tests yield the same decision 
tends to one, both under the null hypothesis 9 = 9q and under a sequence of 
alternatives 9q + /in -1 / 2 . The equivalence follows from contiguity, the expan¬ 
sion (12.62), and the fact that I(9 n ) —> I(9o ) in probability under 9o and under 
9o + hn~ x ^ 2 . Note that the two tests may differ greatly for alternatives far from 
9q; see Example 13.3.3. 


Example 12.4.4 (Bivariate Normal Correlation) Assume X, = ( Ui,Vi ) 

are i.i.d. according to the bivariate normal distribution with means zero and 
variances one, so that the only unknown parameter is p, the correlation. In this 
case, 


log L n (p) = -n log(27r) - | log(l - p 2 ) - _ p 2 ) ~ 2 P UiVi + V X 

SO 

n n 

!og L n (p) = + — £ UM - ~ 2 PW* + ■ 


and so 

8 >_ 

dp 

In the special case #o = po = 0 


= n -1/2 Y, ^ ^(0. !) • 

i=1 


For other values of po, the statistic is more complicated; however, we have by¬ 
passed maximizing the likelihood, which may have multiple roots in this example. 


For general k, consider testing a simple null hypothesis 9 = 9o versus a multi¬ 
sided alternative 9 ^ 9o ■ Then, assuming the expansion (12.62), we can replace 
w 1 ^ 2 (^n ~ 9o) in the Wald statistic (12.70) by I^ 1 (9o)Z n . In this case, the score 
test rejects the null hypothesis when the multi-sided score statistic Z„ I~ 1 (9o)Z n 
exceeds Cfc,i_ a , and is asymptotically level a. Again, the Wald test and Rao’s 
score test are asymptotically equivalent in the sense described above. 


Next, we consider a composite null hypothesis. Interest focuses on 9 i, ..., 9 r , 
the first r components of 9 with the remaining k — r components viewed as 
nuisance parameters. Let 0i,o, ■ ■ ■, 0r,o be fixed and consider testing the null 
hypothesis 9{ = 9i,o for i = 1,..., r. The Wald test is based on the limit 

n 1/2 (fl„, 1 - 9 1 , ..., 0 n ,r - 9r) 4 N (o, E (r) (0)) , 


where E (9) = 7 _1 (0) and E ^(9) is the r x r matrix formed by the intersection of 
the first r rows and columns of E (9). Similarly, define (9) as the r x r matrix 
formed by the intersection of the first r rows and columns of 1(9). Partition 1(9) 
as 


m = 


I {r) (9) I 12 (0) \ 

hi (9) 122(9) ) 


(12.75) 
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Note that (Problem 12.49) 

[S (r) W] _1 = [J (r) (#)] - IMe)I^(d)I 2 1 ( 0 ) . (12.76) 

The score test is based on Zn\o ), the r -vector obtained as the first r components 
of Z n (6), where Z n (6) is defined in (12.59). Under q.m.d. at 6, 

zL r \o) 4 iv (o,/ w (#)) , 

and so, 

Sn(P) = 4 xl ■ 

However, when the null hypothesis is not completely specified, the Rao score test 
statistic is S n (6 n , o), where 

On ,0 = (#1,0, • ■ ■ , Or, 0, #r+l,0, • • • , #fc,o) 

is an efficient likelihood estimator of 0 under the restricted parameter space 
satisfying the constraints of the null hypothesis. In fact, as argued by Hall and 
Mathiason (1990), any n 1 ^ 2 -consistent estimator can be used in the score statistic. 
One-sided score tests are studied in Silvapulle and Silvapulle (1995). 


12-4-4 Likelihood Ratio Tests 


In addition to that Wald and Rao scores tests of Sections 12.4.2 and 12.4.3, let 
us now consider a third test of 0 £ S2o versus 0 £ Q o, based on the likelihood ratio 
statistic 2 log (R n ), where 


su Pegn L n {0) 
su Psg n o L n{0) ' 


(12.77) 


The likelihood ratio test rejects for large values of 21og(i?„). If 0 n and 0 n ,o are 
MLEs for 8 as 9 varies in Q and S2o respectively, then 

Rn = Ln(0n)/Ln(0n, o) • (12.78) 


Example 12.4.5 (Multivariate Normal Mean) Suppose A' = (AT,..., AT) 3 
is multivariate normal with unknown mean vector 0 and known positive definite 
covariance matrix E. The likelihood function is given by 

' s '- 1/2 r i.„ -u- ( x-#)l 


(27r) fc / : 


exp 


- 2 (A--#r 


Assume 0 £ IR fc and that the null hypothesis asserts Oi — 0 for i = 1 
Then, 

2 log(Ri) = - inf (A - #) t E~ 1 (A - 9) + X t YT 1 X = X t YT 1 X = |E~ 1/2 A| 2 . 


Under the null hypothesis, E“V 2 X i s exactly standard multivariate normal, and 
so the null distribution of 21og(i?i) is exactly Xk in this case. 

Now, consider testing the composite hypothesis #; = 0 for i = 1 ,p, with 
the remaining parameters 6 p +i,... ,0k regarded as nuisance parameters. More 
generally, suppose 


fl 0 — {9 = (Oi,... ,6k) ■ A{0 — a) = 0} , 


(12.79) 
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where A is a p x k matrix of rank p and a is some fixed k x 1 vector. Then, 

2 log(-Ri) = - inf (X - 0) T E _1 (X -0)+ inf (X - d) T E _1 (XI - 6) 
eelR* 

= inf (X - 0) T E _1 (X - 9) . (12.80) 

eeci 0 

The null distribution of (12.80) is Xp (Problem 12.50). ■ 

Let us now consider the large sample behavior of the likelihood ratio test in 
greater generality. First, suppose f2o = {#o} is simple. Then, 

log(R n ) = sup [log (L„^)] , 
h 

where L„,/j is defined in (12.57). If the family is q.m.d. at do, then 
log(-Rn) = sup[(h,Z n ) - )-{h,I(6 0 )h) +opg > (1)] . 

It is then plausible that log(7?„) should behave like 

log tin = SUp[log(L„ )h )] , 
h 

where L n ,h is defined by (12.60). But L n ,h is maximized at h n = I~ 1 (0o)Z n and 
so 

log(-Rn)«iog(i?„) = iog(L ni ^ n ) = l -z n r 1 {d 0 )z n . 

Since, 2 log(i? n ) -4 xl, the heuristics suggest that 2 log(J? n ) -4 xl as well. In fact, 
21og(7? n ) is Rao’s score test statistic, and so these heuristics also suggest that 
Rao’s score test, the likelihood ratio test, and Wald’s test, are all asymptotically 
equivalent in the sense described earlier in comparing the Wald test and the score 
test. Note, however, that the tests are not always asymptotically equivalent; some 
striking differences will be presented in Section 13.3. 

These heuristics can be made rigorous under stronger assumptions, such as 
Cramer type differentiability conditions used in proving asymptotic normality of 
the MLE or an ELE; see Theorem 7.7.2 in Lehmann (1999). Alternatively, once 
the general heuristics point toward the limiting behavior, the approximations 
may be made rigorous by direct calculation in a particular situation. A general 
theorem based on the existence of efficient likelihood estimators will be presented 
following the next example. 


Example 12.4.6 (Multinomial Goodness of Fit) Consider a sequence of n 
independent trials, each resulting in one of k + 1 outcomes 1 ,.. ., k + 1 . Outcome 
j occurs with probability Pj on any given trial. Let Y) be the number of trials 
resulting in outcome j. Consider testing the simple null hypothesis Pj = nj for 
j == 1 ,..., k + 1 . The parameter space 12 is 

k 

Q = {(pi,... ,Pk) e IR fc : Pi > 0, P;i < 1} 

3 = 1 


( 12 . 81 ) 
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since Pk+i is determined as 1 — X]j=i Pi ■ I n this case, the likelihood can be written 
as 

Ln[Pl, ■ ■ ■ ,Pk) — | . . . y- 1 \^ >1 '''Pk + i ' 

By solving the likelihood equations, it is easily checked that the unique MLE is 
given by pj = Yj/n (Problem 12.55 (i)). Hence, the likelihood ratio statistic is 

L n {Y\/n ,..., Y k /n) 

L n (m ,..., Tv k ) ’ 

and so (Problem 12.55 (ii)) 


k+1 


log(7? n ) =n^pj log(^-) . 


(12.82) 


j=i 


The previous heuristics suggest that 21og(7? n ) converges in distribution to xL 
which will be proved in Theorem 12.4.2 below. Note that the Taylor expansion 

f(x) = *log(a;/*o) = (x- x 0 ) + ^r*— (* - x 0 ) 2 + o[(x - x 0 ) 2 ] 

2*o 

as * — > xo implies 21og(i?„) « Q n , where Q n is Pearson’s Chi-squared statistic 
given by 


k+1 


Qn -^2 

3 = 1 


(Yj - nnj) 2 


(12.83) 


p 

Indeed 21og(J? n ) — Qn -4- 0, under the null hypothesis (Problem 12.57) and so 
they have the same limiting distribution. Moreover, it can be checked (Problem 
12.56) that Rao’s Score test statistic is exactly Q n . The Chi-squared test will be 
treated more fully in Section 14.3. ■ 


Next, we present a fairly general result on the asymptotic distribution of the 
likelihood ratio statistic. Actually, we consider a generalization of the likelihood 
ratio statistic. Rather than having to compute the maximum likelihood estimators 
0 n and 6 n ,o in (12.78), we assume these estimators satisfy (12.62) under the 
models with parameter spaces S2 and flo, respectively. 


Theorem 12.4.2 Assume X\,...,X n are i.i.d. according to q.m.d. family 
{Pg, 0 £ fi}, where is an open subset o/IR fc and 1(0) is positive definite. 

(i) Consider testing the simple null hypothesis 0 = 0 q. Suppose 0 n is an efficient 
estimator for 0 assuming 0 £ Q in the sense that it satisfies (12.62) when 0 = 0 q. 
Then, the likelihood ratio R n = L n (9 n ) / L n (0o) satisfies, under Qo, 

2 log(i?„) 4 yi ■ 

(ii) Consider testing the composite null hypothesis 0 £ flo, where 

n o = {0 = (6 1 ,...,0 k ) : A(0-a) = 0}, (12.84) 

and A is a px k matrix of rank p and a is a fixed kx 1 vector. Let O n ,o denote an 
efficient estimator of 0 assuming 0 £ Qo; that is, assume the expansion (12.62) 
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holds based on the model {Pe, 8 £ Do} and any 6 £ Do- Then, the likelihood ratio 
R n = L n (8n) / L r (O n fi) satisfies, under any do £ Do, 

2 log(-Rn) Xp ■ 

(in) More generally, suppose Do is represented as 

D o = {0: g=(gi(9),...,g p (d)) T = 0} , 

where gi(9 ) is a continuously differentiable function from IR fc to IR. Let D = D(9 ) 
be the p x k matrix with ( i,j ) entry dgi(9) / ddj, assumed to have rank p. Then, 
2 log(-Rn) —t Xp- 

Proof. First, consider (i). Let h n = — Oo) so that 

2 log(-R„) = 2 log (L nik J . 

Fix any c > 0 and define 

e n ,c = sup |log(L„,h) - [(h,Z n ) - \(h,I{d Q )h)]\ ; 

|fe|<c 2 

by Remark 12.2.2, e„ iC —> 0 in probability under 8o . By the triangle inequality, 
21 °g ( L n,h n ) < 2 [{h n ,Z n ) - h,I(9 0 )h ) + e n , c ] 

if \hn\ < c. But, using (12.62), 

2[{hn,z n )~ \(h n ,i(o 0 )h n )} = zlr\9o)z n + op ea {i) ■ 

so, 

21og (L n £ n ) < Z^I 1 (9o)Z n + tn lC 

if | h n | < c, where e n ,c —> 0 in probability under 9o for any c > 0. Therefore, 

P{2 log(L n fin > *} < P{Z^I~ 1 (0 o )Z n + e n , c > x, \h n \ < c} + P{\h„\ > c} 

^ P{Z n I (9o)Z n + €n,c ^ x} + P { \hn\ > c} . (12.85) 

But, under 9o , Z^I~ 1 (9o)Z n is asymptotically xt an d h n —> Z where Z is 
N(0, L' 1 )^)), so (12.85) tends to 

P{xl >x} + P{\Z\ >c} . 

Let c —^ oo to conclude 

limsupP{21og(L n ^ n > x} < P{\1 > *} . 

n 

A similar argument yields 

lim inf P{2 log(L„ - h > x} > P{\1 > x] , (12.86) 

n ’ n 

and (i) is proved. 

The proof of (ii) is based on a similar argument, combined with the results 
of Example 12.4.5 for testing a composite null hypothesis about a multivariate 
normal mean vector. The proof of (iii) is left as an exercise (Problem 12.60). ■ 
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In the special case where the null hypothesis is specified by 9i = 8ip for 
i = 1,... ,p, with 9i regarded as a nuisance parameter for 9i > p, the degrees of 
freedom can be remembered as the dimension of minus the dimension of flo- 

Example 12.4.7 (One-sample Normal Mean) Suppose Xi,..., X n are i.i.d. 

N(p,a 2 ) with both parameters unknown. Consider testing p = 0 versus p ^ 0. 
Then (Problem 12.46), 

21og( J R„)=log(l+-*”-) , (12.87) 

n — 1 

where t 2 = nX 2 /Sn is the one-sample f-statistic. By Problem 11.89, one can 
deduce the following Edgeworth expansion for 21og(i? n ) (Problem 12.47): 

P{2 log(i? n ) < r} = 1 - 2[$(-*) + ^-z<f>(z)\ + 0(n~ 2 ) , (12.88) 

where z = y/r, 4? is the standard normal c.d.f. and = <j>. This implies that 
the test that rejects when 21og(7?„) > «i_a has rejection probability equal to 
a + 0(n _1 ). But, a simple correction, known as a Bartlett correction, can improve 
the xi approximation. Indeed, (12.88) and a Taylor expansion implies 

P{21og(U„)(l + -) > «i_«} = a + 0(n~ 2 ) , (12.89) 

n 2 

if we take b = 3/2. Thus, the error in rejection probability of the Bartlett- 
corrected test is 0(n -2 ). Of course, in this example, the exact two-sided t-test is 
available. ■ 

It is worth knowing that, quite generally, a simple multiplicative correction to 
the likelihood ratio statistic greatly improves the quality of the approximation. 
Specifically, for an appropriate choice of 6, comparing 2 log(i?„)(l+^) to the usual 
limiting \p reduces the error in rejection probability from 0(n -1 ) to 0(n -2 ). In 
practice, b can be derived by analytical means or estimated. The idea for such 
a Bartlett correction originated in Bartlett (1937). For appropriate regularity 
conditions that imply a Bartlett correction works, see Barndorff-Nielsen and Hall 
(1988), Bickel and Ghosh (1990), Jensen (1993) and DiCiccio and Stern (1994). 


12.5 Problems 

Section 12.2 

Problem 12.1 Generalize Example 12.2.1 to the case where X is multivariate 
normal with mean vector 9 and nonsingular covariance matrix E. 

Problem 12.2 Generalize Example 12.2.2 to the case of a multiparameter 
exponential family. Compare with the result of Problem 12.1. 

Problem 12.3 Suppose g„ is a sequence of functions in L 2 (/z); that is, 
f g^dp < oo. Assume, for some function g, J ( g n — g) 2 dp —> 0. Prove that 
f g 2 dp < oo. 
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Problem 12.4 Suppose g„ is a sequence of functions in L 2 (p) and, for some 
function g, f ( g n — g) 2 dp —» 0. If f h 2 dp < oo, show that f hg n dp —> f hgdp. 

Problem 12.5 Suppose A' and Y are independent, with X distributed as Pg 
and Y as Pe, as 9 varies in a common index set D. Assume the families {Pe} and 
{Pe} are q.m.d. with Fisher Information matrices Ix(9 ) and Iy(9), respectively. 
Show that the model based on the joint data (A, Y ) is q.m.d. and its Fisher 
Information matrix is given by Ix(9) + Iy(9). 

Problem 12.6 Fix a probability P. Let u(x) satisfy 

J u(x)dP(x) = 0 . 

(i) Assume sup^. |u(*)| < oo, so that 

Pe(x) = [1 + 9u(x)} 

defines a family of densities (with respect to P) for all small |#|. Show this family 
is q.m.d. at 9 = 0. Calculate the quadratic mean derivative, score function, and 
1 ( 0 ). 

(ii) Alternatively, if u is unbounded, define pe(x) = C(9) exp(9u(x)), assuming 
f exp(9u(x))dx exists for all small \0\. For this family, argue the family is q.m.d. 
at 9 = 0, and calculate the score function and /(0). 

(iii) Suppose f u 2 (x)dP(x) < oo. Define 

pe(x) = C(9) 2[1 + exp(— 29u{x))]~ 1 . 

Show this family is q.m.d. at 9 = 0, and calculate the score function and /(0). 
[The constructions in this problem are important for nonparametric applications, 
used later in Chapters 13 and 14. The last construction is given in van der Vaart 
(1998).] 

Problem 12.7 Fix a probability P on S and functions Ui(x ) such that 
f Ui(x)d.P(x) = 0 and f u 2 (x)dP(x) < oo, for i — 1,2. Adapt Problem 12.6 to 
construct a family of distributions Pe with 9 £ JR 2 , defined for all small \6\, such 
that P 0 ,o = P, the family is q.m.d. at 9 = (0, 0) with score vector at 9 = (0, 0) 
given by [u\{x),U 2 {x)). If S is the real line, construct the Pg that works even if 
Pg is required to be smooth if P and the Ui are smooth (i.e. having differentiable 
densities) or subject to moment constraints (i.e. having finite pth moments). 

Problem 12.8 Show that the definition of 1(9) in Definition 12.2.2 does not 
depend on the choice of dominating measure p. 

Problem 12.9 In Examples 12.2.3 and 12.2.4, find the quadratic mean 
derivative and 1(9). 

Problem 12.10 In Example 12.2.5, show that J{[f'(x)] 2 /f(x)}dx is finite iff 

P > 1 / 2 . 

Problem 12.11 Prove Theorem 12.2.2 using an argument similar to the proof 
of Theorem 12.2.1. 
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Problem 12.12 Suppose {P»} is q.m.d. at do with derivative p(-, do). Show that, 
on {a: : pe 0 (x) = 0}, we must have T](x, do) = 0, except possibly on a /u,-null set. 
Hint: On {pg 0 (x) = 0}, write 

0 < n 1/2 p 1 e / o 2 +hri _ 1 / 2 (*) = (h,r)(x,d 0 )) + r n ,h(x) , 

where f r^ h (x)p(dx) —> 0. This implies, with h fixed, that r n ,h(x) —» 0 except 
for x in /r-null set, at least along some subsequence. 

Problem 12.13 Suppose {Pe} is q.m.d. at do. Show 

Pg 0+ h{x : pe 0 {x) = 0} = o(\h\ 2 ) 

as \h\ —> 0. Hence, if Xi,... ,X n are i.i.d. with likelihood ratio L„ t h defined by 

(12.12) , show that 

P$ 0 +hn~ 1 / 2 { L n,h = °°} 0 • 

Problem 12.14 To see what might happen when the parameter space is not 
open, let 

fo(x) = xl {0 < x < 1} + (2 — x)I{ 1 < x < 2} . 

Consider the family of densities indexed by d £ [0,1) defined by 
pe(x) = (1 - d 2 )f 0 (x) + d 2 f 0 (x - 2) . 

Show that the condition (12.5) holds when do = 0, if it is only required that h 
tends to 0 through positive values. Investigate the behavior of the likelihood ratio 

(12.12) for such a family. (For a more general treatment, consult Pollard (1997).) 

Problem 12.15 Suppose Xi,...,X n are i.i.d. and uniformly distributed on 
(0,6). Let pe(x) = 1 T{0 < x < d}. and L n (6) = Y[iPe(Xi). Fix p and do. 

Determine the limiting behavior of L n (6o + hn~ p )/L n (do) under do- For what p 
is the limiting distribution nondegenerate? 

Problem 12.16 Suppose {Pg,d € 12} is a model with 12 an open subset of 
IR fc , and having densities pe(x) with respect to p. Define the model to be Li- 
differentiable at do if there exists a vector of real-valued functions £(-,#o) such 
that 

J \Pe 0 +h(x)-pe 0 (x) - (C(x,do),h)\dp(x) = o(|h|) (12.90) 

as \h\ —¥ 0. Show that, if the family is q.m.d. at do with q.m. derivative rj(-,do), 
then it is Li-differentiable with 

((x, do) = 2ri(x,d 0 )p 1 e / 0 2 (x) , 

but the converse is false. 

Problem 12.17 Assume {Pg,d € 12} is Li-differentiable, so that (12.90) holds. 
For simplicity, assume k — 1 (but the problem generalizes). Let </>(■) be uniformly 
bounded and set (5(d) = Eg[(j>(X)\. Show, f3'(d) exists at #o and 

(5'(d 0 ) = J (/)(x)C(x,d 0 )p(dx) . 


(12.91) 
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Hence, if {P<?} is q.m.d. at 80 with derivative r](-, 8 o), then, 

P\e 0 ) = J (p(x)rj(x, 9 0 )pe o (x)iJ,(dx) , (12.92) 

where rj(x, 8 o) = 2r](x,8o)/p\^(x). More generally, if Xi,...,X n are i.i.d. Pg 
and <j>(Xi ,..., X n ) is uniformly bounded, then (3(8) = Ee[(j>{X\, ..., A'„)] is 
differentiable at 8o with 


P\8 0 ) 



n n 

<t>(xi,.. ■,x n )'y^ j 'n(xi,8o)Wpe 0 (xi)p(dxi) 

i= 1 i= 1 


p{dXn) . 

(12.93) 


Section 12.3 

Problem 12.18 Prove (12.31). 

Problem 12.19 Show the convergence (12.35). 

Problem 12.20 Fix two probabilities P and Q and let P n — P and Q n = Q. 
Show that {Pn} and {Q n } are contiguous iff P and Q are absolutely continuous. 

Problem 12.21 Fix two probabilities P and Q and let P„ = P n and Q n = Q n . 
Show that {P n } and { Q n } are contiguous iff P = Q. 

Problem 12.22 Suppose Q n is contiguous to P„ and let L„ be the likelihood 
ratio defined by (12.36). Show that Pp n (L„) —> 1. Is the converse true? 

Problem 12.23 Consider a sequence {P n ,Qn} with likelihood ratio L n defined 
in (12.36). Assume 

C{L n \P n ) A W , 

where P{W = 0} = 0. Deduce that P„ is contiguous to Q n . Also, under the 
assumptions of Corollary 12.3.1, deduce that P„ and Q n are mutually contiguous. 

Problem 12.24 Suppose, under P„, X n = Y n + op n (1); that is, X n — Y n —> 0 in 
P„-probability. Suppose Q n is contiguous to P„. Show that A'„ = Y n + OQ n ( 1). 

Problem 12.25 Suppose X n has distribution P„ or Q n and T„ = T„(A'„) is 
sufficient. Let P„ and Q ^ denote the distribution of T n under P„ and Q n , respec¬ 
tively. Prove or disprove: Q„ is contiguous to P„ if and only if Q ^ is contiguous 
to Pi. 

Problem 12.26 Suppose Q is absolutely continuous with respect to P. If 
P{Pn} —t 0, then Q{P n } —t 0. 

Problem 12.27 Prove the convergence (12.40). 

Problem 12.28 Show that < 71,2 in (12.52) reduces to h/\Jiv. 
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Problem 12.29 Verify (12.53) and evaluate it in the case where f(x) = 
exp(—1*|)/2 is the double exponential density. 

Problem 12.30 Suppose X \,..., X n are i.i.d. according to a model which is 
q.m.d. at 9q. For testing 9 = do versus 8 = 9q + hnT 1 ^ 2 , consider the test ip n that 
rejects H if log {L n ,h) exceeds z\- a <Jh — |<r 2 , where L n ,h is defined by (12.54) 
and a 2 = (h, I(9o)h). Find the limiting value of Eg 0+hn -i/ 2 {i>n)- 

Problem 12.31 Suppose Pg is the uniform distribution on (0,0). Fix h and 
determine whether or not P" and P" + ^/ n are mutually contiguous. Consider 
both h > 0 and h < 0. 

Problem 12.32 Assume Xi ,..., X n are i.i.d. according to a family {Pg} which 
is q.m.d. at 0o- Suppose, for some statistic T n = T n {X 1,..., X n ) and some func¬ 
tion fi{0) assumed differentiable at 9q, n 1 ^ 2 (T n — /x(0„)) -4 N(0,a 2 ) under 9 n 
whenever 9 n = 9o + hn -1 ^ 2 . Show the same result holds, first whenever h is 
replaced by h n —» h, and then whenever n 1 ^ 2 (0 n — 0o) = 0(1). 

Problem 12.33 Generalize Corollary 12.3.2 in the following way. Suppose T n = 
(T„,i,... ,T n ,fc) £ IR fc . Assume that, under P n , 

{Tn, 1 , ■ •. ,T„ lfc ,log(L„)) 4 (Tr,... ,T k ,Z) , 

where (Ti,..., Tk, Z) is multivariate normal with Cov{T , Z) = a. Then, under 

Qn , 

{Tn, 1, • • • , T n ,k) ~> {T\ + Cl, . . . ,-Tfc + Cfc) . 

Problem 12.34 Suppose X \,..., X n are i.i.d. according to a model {Pg : 6 £ 
SI}, where O is an open subset of R fc . Assume that the model is q.m.d. Show that 
there cannot exist an estimator sequence T n satisfying 

lim sup Pg{n 1,2 \T n - 9\ > e) = 0 (12.94) 

"-*- 00 \9-g 0 \<n-l/ 2 

for every t > 0 and any 9q. (Here Pg means the joint probability distribution of 
{X \,..., Xn) under 9). Suppose the above condition (12.94) only holds for some 
e > 0. Does the same conclusion hold? 


Section 12-4 

Problem 12.35 In Example 12.4.1, show that the likelihood equations have 
a unique solution which corresponds to a global maximum of the likelihood 
function. 

Problem 12.36 Suppose Xi,..., X n are i.i.d. Pg according to the lognormal 
model of Example 12.2.7. Write down the likelihood function and show that it is 
unbounded. 

Problem 12.37 Generalize Example 12.4.2 to multiparameter exponential 
families. 
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Problem 12.38 Prove Corollary 12.4.1. Hint: Simply define 6 n = do + 
n -1 / 2 / -1 (9o)Z n and apply Theorem 12.4.1. 

Problem 12.39 Let (X;, Y[), i = 1... n be i.i.d. such that Xj and Y) are inde¬ 
pendent and normally distributed, Xi has variance a 2 , \\ has variance r 2 and 
both have common mean /x. 

(i) If a and r are known, determine an efficient likelihood estimator (ELE) /x of 
/x and find the limit distribution of n}^ 2 {fi — /x). 

(ii) If a and r are unknown, provide an estimator /x for which n 1 ^ 2 (/f — /x) has 
the same limit distribution as n 1 ^ 2 (/i — /x). 

(iii) What can you infer from your results (i) and (ii) regarding the Information 
matrix 1(6), 9 = (/x, cr, r)? 

Problem 12.40 Let Xi,..., X n be a sample from a Cauchy location model with 
density f(x — 6), where 

= tt(1 + Z 2 )' 

Compare the limiting distribution of the sample median with that of an efficient 
likelihood estimator. 

Problem 12.41 Let Xi,...,X n be i.i.d. N(9,9 2 ). Compare the asymptotic 
distribution of X' 2 with that of an efficient likelihood estimator sequence. 

Problem 12.42 Let Xi, • • •, X n be i.i.d. with density 
/( x, 9) = [1 + 9 cos(*)]/27t, 

where the parameter 9 satisfies \0\ < 1 and x ranges between 0 and 2-7T. (The 
observations X; may be interpreted as directional data. The case 9 = 0 corre¬ 
sponds to the uniform distribution on the circle.) Construct an efficient likelihood 
estimator of 9, as explicitly as possible. 

Problem 12.43 Suppose Xi,...,X„ are i.i.d., uniformly distributed on [0,0]. 
Find the maximum likelihood estimator 9 n of 6. Determine a sequence t„ such 
that Tn(9n — 9) has a limiting distribution, and determine the limit law. 

Problem 12.44 Verify that h n in (12.61) maximizes L nt h- 

Problem 12.45 For a q.m.d. model with 9 n satisfying (12.62), find the limiting 
behavior of the Wald statistic given in the left side of (12.71) under 6 n = do + 
hn - 1 / 2 . 

Problem 12.46 Suppose Xi,...,X n are i.i.d. X(/x,cr 2 ) with both parameters 
unknown. Consider testing /x = 0 versus /x ^ 0. Find the likelihood ratio 
test statistic, and determine its limiting distribution under the null hypothe¬ 
sis. Calculate the limiting power of the test against the sequence of alternatives 
(/x, a 2 ) = (fen -1 / 2 , cr 2 + fen -1 / 2 ). 

Problem 12.47 In Example 12.4.7, verify (12.88) and (12.89). 
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Problem 12.48 Suppose Xi,..., X n are i.i.d. Pg, with 8 £ S2, an open subset 
of IR fc . Assume the family is q.m.d. at 8o and consider testing the simple null 
hypothesis 8 = 8o . Suppose 9 n is an estimator sequence satisfying (12.62), and 
consider the Wald test statistic n(9„ — 8o) T I(8o)(8 n — do). Find its limiting distri¬ 
bution against the sequence of alternatives 9q + hn^ 1 ^ 2 , as well as an expression 
for its limiting power against such a sequence of alternatives. 

Problem 12.49 Prove (12.76). Then, show that 

[ E (r) ( 0)] _1 < [ 7 ( r ) (< 9 )] . 

What is the statistical interpretation of this inequality? 


Problem 12.50 In Example 12.4.5, consider the case of a composite null hy¬ 
pothesis with Qo given by (12.79). Show that the null distribution of the likelihood 
ratio statistic given by (12.80) is \p■ Hint: First consider the case a = 0 so that 
flo is a linear subspace of dimension k — p. Let Z = £~ 1,/2 X, so that 

21og(7?„) = inf \Z — S~ 1/2 #| 2 . 

0£.Q o 

As 9 varies in S2o, E -1 / 2 # varies in a subspace L of dimension k — p. If P is 
the projection matrix onto L and I is the identity matrix, then 21og(i? n ) = 
\{I-P)Z\ 2 . 

Problem 12.51 In Example 12.4.5, determine the distribution of the likelihood 
ratio statistic against an alternative, both for the simple and composite null 
hypotheses. 

Problem 12.52 Suppose Xi,...,X n are i.i.d. N(p,a 2 ) with both parameters 
unknown. Consider testing the simple null hypothesis (p,a 2 ) = (0,1). Find and 
compare the Wald test, Rao’s Score test, and the likelihood ratio test. 


Problem 12.53 Suppose Xi,..., X n are i.i.d. with the gamma T(gi, b) density 


/(*) 


r (g)* 


o — l —x/b 

-or e ' 


x > 0 , 


with both parameters unknown (and positive). Consider testing the null hypothe¬ 
sis that g = 1, i.e., under the null hypothesis the underlying density is exponential. 
Determine the likelihood ratio test statistic and find its limiting distribution. 


Problem 12.54 Suppose (Xi, Yi),..., (X n ,Y n ) are i.i.d., with Xi also indepen¬ 
dent of Y, . Further suppose X, is normal with mean and variance 1, and Y t is 
normal with mean p .2 and variance 1. It is known that pt > 0 for i = 1,2. The 
problem is to test the null hypothesis that at most one fu is positive versus the 
alternative that both p\ and p 2 are positive. 

(i) Determine the likelihood ratio statistic for this problem. 

(ii) In order to carry out the test, how would you choose the critical value 
(sequence) so that the size of the test is at 
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Problem 12.55 (i) In Example 12.4.6, check that the MLE is given by Pj = 
Yj/n. (ii) Show (12.82). 

Problem 12.56 In Example 12.4.6, show that Rao’s Score test is exactly 
Pearson’s Chi-squared test. 

Problem 12.57 In Example 12.4.6, show that 21og(7? n ) — Q n A 0 under the 
null hypothesis. 

Problem 12.58 Prove (12.86). 

Problem 12.59 Provide the details of the proof to part (ii) of Theorem 12.4.2. 

Problem 12.60 Prove (iii) of Theorem 12.4.2. Hint: If 9q satisfies the null hy¬ 
pothesis g(9o) = 0, then testing 12o behaves asymptotically like testing the null 
hypothesis D(9q)(9 — 9q) = 0, which is a hypothesis of the form considered in 
part (ii) of the theorem. 

Problem 12.61 The problem is to test independence in a contingency table. 
Specifically, suppose X\,..., X n are i.i.d., where each X\ is cross-classified, so 
that Xi = ( r,s ) with probability p r , s , r = 1 s = 1 ,...,S. Under the 

full model, the p TlS vary freely, except they are nonnegative and sum to 1. Let 
p r - = J Z s Pr,s and p. s = ^D r Pr, s • The null hypothesis asserts p r , s = Pr-P-a for all 
r and s. Determine the likelihood ratio test and its limiting null distribution. 

Problem 12.62 Consider the following model which therefore generalizes model 
(iii) of Section 4.7. A sample of m subjects is obtained from class Ai(i — 1,..., a), 
the samples from different classes being independent. If Yij is the number of 
subjects from the ith sample belonging to Bj(j = 1 ,...,&), the joint distribution 
of (Y^i,..., Yi'b) is multinomial, say, 

M{n i \p 1 \ il ... ,p 6 |») • 

Determine the likelihood ratio statistic for testing the hypothesis of homogeneity 
that the vector (pi\i, ■ ■ ■ ,Pb\i) is independent of i, and specify its asymptotic 
distribution. 

Problem 12.63 The hypothesis of symmetry in a square two-way contingency 
table arises when one of the responses Ai,, A a is observed for each of n subjects 
on two occasions (e.g. before and after some intervention). If Y is the number of 
subjects whose responses on the two occasions are (Ai, Aj), the joint distribution 
of the Yij is multinomial, with the probability of a subject response of (A;, A,) 
denoted by pij. The hypothesis H of symmetry states that pij = pj^ for all i and 
j; that is, that the intervention has not changed the probabilities. Determine the 
likelihood ratio statistic for testing H, and specify its asymptotic distribution. 
[Bowker (1948).] 

Problem 12.64 In the situation of Problem 12.63, consider the hypothesis of 
marginal homogeneity H' : pi + = p + i for all i, where pi+ = ^2°j =1 puj, p+i = 

E a 

j =1 Pjii- 
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(i) The maximum-likelihood estimates of the puj under H' are given by pij = 
Yij/(1+Xi—Xj), where the A’s are the solutions of the equations JT Y%j / (1+ 
Xi — Xj) = JY Yij/(1+Xj — Xi). (These equations have no explicit solutions.) 

(ii) Determine the number of degrees of freedom for the limiting ^-distribution 
of the likelihood ratio criterion. 

Problem 12.65 Consider the third of the three sampling schemes for a 2 x 2 x K 
table discussed in Section 4.8, and the two hypotheses 

Hi : Ai — • • • = A k — 1 and H 2 : Ai =•••■= A k- 

(i) Obtain the likelihood-ratio test statistic for testing Hi. 

(ii) Obtain equations that determine the maximum likelihood estimates of the 
parameters under H 2 . (These equations cannot be solved explicitly.) 

(iii) Determine the number of degrees of freedom of the limiting ^-distribution 
of the likelihood ratio test for testing (a) Hi, (b) H 2 . 

[For a discussion of these and related hypotheses, see for example Shaffer (1973), 
Plackett (1981), or Bishop, Fienberg, and Holland (1975), and the recent study 
by Liang and Self (1985).] 

Problem 12.66 Suppose AT,..., X n are i.i.d. N(0, 1). Consider Hodges’ super¬ 
efficient estimator of 9 (unpublished, but cited in Le Cam (1953)), defined as 
follows Let On be 0 if |A n | < n -1 / 4 ; otherwise, let 9 n = X n . For any fixed 0, 
determine the limiting distribution of n}^ 2 (9 n — 9). Next, determine the limiting 
distribution of n}^ 2 (9 n — 9„) under 9 n = hn 

Problem 12.67 Let (Xj,i,X ^ 2 ), j = l,...,n be independent pairs of inde¬ 
pendent exponentially distributed random variables with E(Xj i 1 ) = 9Xj and 
E ( Xj ; 2 ) = A j. Here, 9 and the A j are all unknown. The problem is to test 9 = 1 
against 9 > 1. Compare the Rao, Wald, and likelihood ratio tests for this prob¬ 
lem. Without appealing to any general results, find the limiting distribution of 
your statistics, as well as the limiting power against suitable local alternatives. 
(Note: the number of parameters is increasing with n so you can’t directly appeal 
to our previous large sample results.) 


12.6 Notes 

According to Le Cam and Yang (2000), the notion of quadratic mean differentia¬ 
bility was initiated in conversations between Hajek and Le Cam in 1962. Hajek 
(1962) appears to be the first publication making use of this notion. The impor¬ 
tance of q.m.d. was prominent in the fundamental works of Le Cam (1969, 1970) 
and Hajek (1972), and has been used extensively ever since. 

The notion of (mutual) contiguity is due to Le Cam (1960). Its usefulness 
was soon recognized by Hajek (1962), who first considered the one-sided version. 
Three of Le Cam’s fundamental lemmas concerning contiguity became known 
as Le Cam’s three lemmas, largely due to their prominence in Hajek and Sidak 
(1967). Further results can be found in Roussas (1972), Le Cam (1986), Chapter 
6, Hajek, Sidak, and Sen (1999), and Le Cam and Yang (2000), Chapter 3. 
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The methods studied in Section 12.4 are based on the notion of likelihood, 
whose general importance was recognized in Fisher (1922, 1925). Rigorous ap¬ 
proaches were developed by Wald (1939, 1943) and Cramer (1946). Cramer 
defined the asymptotic efficiency of an asymptotically normal estimator to be 
the ratio of its asymptotic variance to the Fisher Information; that such a def¬ 
inition is flawed even for asymptotically normal estimators was made clear by 
Hodges superefficient estimator (Problem 12.66). Le Cam (1956) introduced the 
one-step maximum likelihood estimator, which is based on a discretization trick 
coupled with a Newton-Raphson approximation. Such estimators satisfy (12.62) 
under weak assumptions and enjoy other optimality properties; for example, see 
Section 7.3 of Millar (1983). The notion of a regular estimator sequence introduced 
at the end of Section 12.4.1 plays an important role in the theory of efficient esti¬ 
mation and the Hajek-Inagaki Convolution Theorem; see Hajek (1970), Le Cam 
(1979), Beran (1999), Millar (1985), and van der Vaart (1988). 

The asymptotic behavior of the likelihood ratio statistic was studied in Wilks 
(1938) and Chernoff (1954). Pearson’s Chi-squared statistic was introduced in 
Pearson (1900) and the Rao score tests by Rao (1947). In fact, the Rao score 
test was actually introduced in the univariate case by Wald (1941b). The asymp¬ 
totic equivalence of many of the classical tests is explored in Hall and Mathiason 
(1990). Methods based on integrated likelihoods are reviewed in Berger, Liseo 
and Wolpert (1999). Caveats about the finite sample behavior of Rao and Wald 
tests are given in Le Cam (1990); also see Fears, Benichou and Gail (1996) and 
Pawitan (2000). The behavior of likelihood ratio tests under nonstandard con¬ 
ditions is studied in Vu and Zhou (1997). Extensions of likelihood methods to 
semiparametric and nonparametric models are developed in Murphy and van der 
Vaart (1997), Owen (1988, 2001) and Fan, Zhang and Zhang (2001). Robust ver¬ 
sion of the Wald, likelihood, and score tests are given in Heritier and Ronchetti 
(1994). 



13 

Large Sample Optimality 


13.1 Testing Sequences, Metrics, and Inequalities 

In this chapter, some asymptotic optimality theory of hypothesis testing is de¬ 
veloped. We consider testing one sequence of distributions against another (the 
asymptotic version of testing a simple hypothesis against a simple alternative). 
It turns out that this problem degenerates if the two sequences are too close 
together or too far apart. The non-degenerate situation can be characterized in 
terms of a suitable distance or metric between the distributions of the two se¬ 
quences. Two such metrics, the total variation and the Hellinger metric, will be 
introduced below. 

We begin by considering some of the basic metrics for probability distributions 
that are useful in statistics. Fundamental inequalities relating these metrics are 
developed, from which some large sample implications can be derived. We now 
recall the definition of a metric space; also see Section A.2 in the appendix. 


Definition 13.1.1 A set V is a metric space if there exists a real-valued function 
d defined on V x V such that, for all points p, q , and r in V , d(p,q) > 0, 
d(p,q) — d(q,p) and d(p,q) < d(p,r) + d(r,q). A function d satisfying these 
conditions is called a metric. 


In the present context, V will be a collection of probabilities on a (measurable) 
space X (endowed with a it- field). We have already encountered two metrics 
on the collection of probability distributions on 1R. One is the Levy distance 
Pl(F, G ), defined in Definition 11.2.3. The other, used in Example 11.2.12, is the 
Kolmogorov-Smirnov distance between distribution functions F and G on the 
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real line, defined as 

d K (F, G ) = sup | F(t) - G(t) | . (13.1) 

t 

It is easy to see that dn is indeed a metric (Problem 11.21). In the context of 
hypothesis testing, two additional distances arise naturally, the total variation 
distance and the Hcllinger distance. 

Before considering the asymptotic problem, consider the problem of testing 
a simple hypothesis Po against a simple alternative Pi. Here, Pi is a probabil¬ 
ity measure on (X,P) and pi will denote the density of Pi with respect to a 
dominating measure p. 

In contrast to previous chapters where the hypothesis and alternative were 
treated asymmetrically, consider the problem of finding the test <j> = cf>(X) that 
minimizes the sum of the error probabilities. For a test </>, denote the sum of the 
probability of rejecting Po when Po is true and the probability of rejecting Pi 
when Pi is true by 

S Po ,Pi(<t>) = [ 4>(x)dPo(x) + [ (1 - 4>{x))dP\{x) . (13.2) 

J X J X 

and let 

S(P 0 , Pi) = inf [S Po , Pl (0)] . (13.3) 

The following theorem gives the test cj>* that minimizes Sp 0l p 1 (cf>) over all possible 
tests <j>, as well as a simple expression for S(Po,Pi). Just as in the Neyman- 
Pearson setup where the level a is fixed, the optimal test <j>* is based on comparing 
po with pi according to the likelihood ratio pi ( x ) /po (x ), so that the only difference 
is the choice of critical value. 


Theorem 13.1.1 S'p 0i p 1 (§!>) is minimized by taking <j> = <j>* a.e. p, where cj>* is 
any test satisfying <f*(x) = 1 if pi(x) > po(x) and 4>*(x) = 0 if pi(x) < po(x). 
Furthermore, 


S(P 0 ,Pi) = Sp 0 , Pi (</>*) = 1 j |Pi(*) ~Po{x)\p(dx) . 

(13.4) 

Proof. For any test cf), 


Spo,Pi(‘t>)= 4>{ x ){Po{x) — pi(x))p(dx) + 1 . 

J X 

(13.5) 

Let D- = {x : po(x) — pi(x) < 0}. On D-, the integrand is minimized by 
taking rf>*(x) = 1 (since the only constraint on <j>* is that it take values in [ 0 , 1 ]). 
Similarly, on D+ = {x : po(x ) — pi(x) > 0}, the integrand is minimized by 
taking <j>*(x) = 0. On the set {x : po(x) = pi(x)}, it does not matter how <j>*{x) 
is defined. Thus, for any minimizing cf*, 

Sp»,Pi (<£*) = / \po(x) — Pi (x)\p(dx) + 1 . 

(13.6) 

Reversing the roles of Po and Pi yields 


5pi,p o ( 0*)= / [pi(a:) -p 0 {x)\p(dx) + 1 . 

J D + 

(13.7) 
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By symmetry, both expressions are the same, so summing the last two equations 
and then dividing by two yields 

-SpoTiO*) = 1 + f [po(a:) -pi{x)\p(x) + [ \pi(x) - p 0 {x)\p{dx)) (13.8) 
A J £>_ JD + 

= 1 \j -Po{x)\p(dx) ■ ■ (!3-9) 

The integral appearing in the last expression leads us to the so-called total 
variation distance between Po and Pi. 

Definition 13.1.2 The total variation distance between Po and Pi, denoted 
||Pi - Po||i, is given by 

11 -Pi - Po||i = J \pi-po\dp, (13.10) 

where pi is the density of Pi with respect to any measure p dominating both Po 
and Pi. 

It is easy to see that this distance defines a metric (Problem 13.1) and that 
this distance is independent of the choice of dominating measure p. For alterna¬ 
tive characterizations of the total variation distance, see Problem 13.2. Equation 
(13.9) can be restated as 

5p 0 ,p 1 (^) = 1-^||A-Po||i • (13.11) 

If AT,..., X n are i.i.d. P, let P" denote their joint distribution. We will next 
consider a sequence of tests (p n for testing P" against The minimum sum of 
error probabilities is then S(P™ , Q”). The test (sequence) that minimizes the sum 
of error probabilities is connected with the more usual test in which probability 
of false rejection of P" is fixed at a by the following lemma. The proof is left as 
an exercise (Problem 13.5). 

Lemma 13.1.1 (i) If there exists a sequence of tests </>„ for which the sum of 
error probabilities tends to 0, then given any fixed a (0 < a < 1) and n sufficiently 
large, the level of cf> n will be less than a, and its power will tend to 1 as n —» oo. 
(ii). If for every sequence {<f n }, the sum of the error probabilities tends to 1, 
then for any sequence whose rejection probability under P” tends to a, the lim¬ 
iting power is a, and hence is no better than that of a test that rejects P" with 
probability a independent of the data. 

We would like to determine conditions for which the limiting sum of error 
probabilities is zero or one, as well as for the more important intermediate sit¬ 
uation. In order to determine the limiting behavior of S(P", Q™), we need to 
study the behavior of \\P™ — Q(j||i. Unfortunately, this quantity is often difficult 
to compute, but it is related to another distance which is easier to manage. This 
is the following Hellinger distance. 
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Definition 13.1.3 Let Po and Pi be probabilities on (X,T). The Hellinger 
distance H(Po, Pi) between Po and Pi is given by 

P 2 (P 0 ,Pi) = \/Mx)] 2 dfi(x) , (13.12) 

where p; is the density of Pi with respect to any measure p dominating Po and 

Pi. 

The value of H(Po,Pi) is independent of the choice of p (Problem 13.1) and 
one can, for example, always use p = Po + Pi. It is also easy to see that this 
distance defines a metric. 1 By squaring the integrand and using the fact that the 
densities pi must integrate to one, it follows that 

P 2 (P 0 , Pi) = 1 - p(P 0 , Pi) , (13.13) 

where p(Po, Pi) is known as the affinity between Po and Pi and is given by 

p(P 0 ,Pi) = f \Jpo(x)pi(x)dp(x) . (13.14) 

J x 

Note that, by Cauchy-Schwarz, 0 < p(Po,Pi) < 1 and p(Po,Pi) = 1 if and only 
if Po = Pi. Furthermore, p(Po,Pi) = 0 if and only if Po and Pi are mutually 
singular , i.e., there exists a (measurable) set E with Po(P) = 1 and Pi{E) = 0. 
It follows, for example, that H(Po,Pi) = 0 if and only if Po = Pi. 

From equation (13.14), it immediately follows that 

p(P 0 ",pn = p"(P 0 ,Pi) (13.15) 

and hence 

P 2 (P 0 n ,PT) = l-p"(P 0 ,Pi) = 1- [l-P 2 (P 0 ,Pi)] n . (13.16) 

Therefore, the behavior of H 2 (Pq , P") with increasing n can be obtained from 
n and P(Po, Pi) in a simple way. 

Next, we will relate H(Po , Pi) to ||Po — Pi||i, which was already seen to have 
a clear statistical interpretation. 

Theorem 13.1.2 The following relationships hold between Hellinger distance 
and total variation distance: 

H 2 (P 0 ,Pi) < i||P 0 -Pi||i 

<P(Po,Pi)[2-P 2 (Po,Pi)] 1/2 = [l-p 2 (Po,Pi)] 1/2 . (13.17) 

Proof. To prove the first inequality, note that 

H 2 (P 0 ,Pi) = \ JWp i - VPofdp < d / ' l-v/Pi + VPo\dp 


1 Some authors prefer to leave out the constant 1/2 in their definition. Using Definition 
13.1.3, the square of the Hellinger distance between Po and Pi is just one-half the square 
of the L 2 (ju)-distance between y/po and y/pi- Using the Hellinger distance makes it 
unnecessary to choose a particular /i, and the Hellinger distance is even defined for all 
pairs of probabilities on a space where no single dominating measure exists. 
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I \pi-p 0 \dn=^\\P 0 -P 1 \\ 1 . 


To prove the second inequality, apply the Cauchy-Schwarz inequality to get 
1 


^\\ p o - Pi||i = 2 


< 


\ J\Vpi - Vpo\ ■ Wp^ + VvoW 

[\ f(Vpi-Vpo) 2 diA 1/2 [\ JiVPi+VPofdlA 1 ' 

= H(Po,Pi)[^ j (VpT+Vw) 2 d/x ] 1/2 


= H(P 0 , Pi)[l + p(P 0 , Pi)] 1/2 = H(P 0 , Pi)[2 - H 2 (P 0 , Pi)] 1/2 , 

with the last equality following from the definition H 2 (Po,Pi) = 1 — p(Po,Pi); 
the last equality in the statement of the theorem follows immediately from this 
definition as well. ■ 

Consider now the problem of deciding between Pq and P" based on n i.i.d. 
observations from Po or Pi. Theorems 13.1.1 and 13.1.2 immediately yield the 
following result. 


Corollary 13.1.1 Fix any Po and Pi with Po ^ Pi. Then, S(Po,Pi) tends to 
0 exponentially fast; more specifically, 

S(Po, PD < p n (P 0 , Pi) ^ 0 as n -s- oo . (13.18) 

Proof. By Theorem 13.1.2 and equation (13.16), 

\\\ps - priii > p 2 (p 0 n ,Pi n ) = i - p n (p 0 ,Pi). 

Hence, by Theorem 13.1.1 and (13.19), 

S(Po , Pi") = 1 - ^||P 0 " - Pill < p n (Po, Pi) -»■ 0 
asn->oo, since p(Po, Pi) < 1 as Po ^ Pi. ■ 

Thus, we can conclude there always exists a perfectly discriminating sequence 
of tests for testing Po against Pi based on n i.i.d. observations in the sense that 
the sum of the error probabilities tends to 0. 

Since, for any fixed n, the probabilities of error in testing Pq against P" are not 
zero (unless Po and Pi are singular), such asymptotic convergence is of limited 
value. To obtain a more discriminating result, we will consider the problem of 
testing Pg 0 against Pg n based on n i.i.d. observations, where Pg n is a sequence of 
probability distributions getting closer to Pe 0 . Closeness here will conveniently 
be expressed by the Hellinger metric. We would like to consider Pg n close enough 
to Pg 0 as n —> oo so that the testing problem becomes difficult for the statistician 
in the sense that there does not exist a test sequence whose error probabilities 
both tend to zero. On the other hand, we would also not want Pg n and Pg 0 to be 
so close that no sequence of tests will have any reasonable amount of power. The 
following theorem characterizes this situation and shows that the intermediate 
situation occurs if and only if nH 2 (Pg 0 , Pg n ) x 1. 


(13.19) 

(13.20) 
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Theorem 13.1.3 Suppose 

ci = liminf nH 2 (Pg 0 , Pg n ) < limsupnH 2 (Pe 0 , Pg n ) = C 2 . 

Then, 

1 - [1 - exp(- 2c 2 )] 1/2 < liminf S(Pg 0 ,Pg n ) 


(13.21) 

(13.22) 


< lim sup S (Pg 0 , P$ n ) < exp(-ci) . 

Proof. To prove lim sup S^P^Pg^) < exp(—ci), assume first that 

nH 2 (Pg 0 , Pg n ) —> c > ci . 

By Corollary 13.1.1, 

S(P 9 ™,P e n J < p n (Pe 0 ,Pe J = [1 - H 2 (Pg 0 ,P e J] n exp(-c) < exp(-ci) . 

By applying this argument to subsequences d nj such that njH 2 (Pg 0 , Pg n .) 
converges, the last inequality in (13.22) follows. Similarly, suppose 

nH 2 (Pg 0 , Pg n ) -» c < c 2 - 


The first inequality follows if we show that 

1 - [1 - exp(—2c)] 1/2 < liminf S(Pg ,Pg n ) . 

n 


By Theorem 13.1.1 and then Theorem 13.1.2, 

S(Pg 0 ,PeJ = l-h\Pe n -Pe 0 \\i>l-[l- P 2 ( Pb 0 , Pe, 


11/2 


By (13.15), this becomes 

1-[1 -p 2n (Pe 0 ,Pe n )] 1/2 = 1-{1-[1 -H 2 (Pg 0 ,Pg n )] 2n } 1/2 -> 1—[1—exp(—2c)] 1/2 , 


and the result follows. ■ 


Thus, from an asymptotic point of view, it is reasonable to consider alternatives 
8 n to do such that nH 2 (Pg 0 , Pg n ) is bounded away from 0 and oo. Otherwise, the 
problem is asymptotically degenerate in the sense that, either there exists a test 
sequence 4> n for testing do versus d n such that the probability of a type 1 error 
tends to zero and the power at d n tends to one, or no sequence of level a tests 
will have asymptotic power greater than a. We next consider what the condition 
on nH 2 (Pg 0 , Pg n ) becomes in some classical examples. 


Example 13.1.1 (Quadratic Mean Differentiable Families) Assume that 
{Pg,d £ ff} is q.m.d. with derivative r/(-,do) at do and positive definite I (do). 
Suppose n 1/ ^ 2 (0„ — do) —> h. By equation (12.6) and Lemma 12.2.2, 

'2nH 2 (Pg 0 , Pg n ) = n J [^/pg), - y/pe^) 2 dp 

-t J\(Ti(x,do),h)\ 2 dp(x) = ^(h,I{d 0 )h) < oo . (13.23) 

Thus, the nondegenerate situation occurs when | d n — #o| = (//(rW 1 ' 2 ). Note that 
the limiting value (13.23) is never 0 unless h = 0 (Problem 13.8). ■ 
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Example 13.1.2 (Uniform Family; Example 12.2.8, continued) Let Pg be 

the uniform distribution on (0, 9). Then, nH 2 (Pg 0 , Pe n ) tends to a finite, positive 
limit if and only if n(9„ — do) —> h < oo (Problem 13.4). Hence, alternatives 
9 n such that 6 n — 9 o x n _1 cannot be perfectly discriminated, yet tests can be 
constructed that have reasonable power against these alternatives. ■ 

To clarify the difference between the previous two examples, note that in 
Example 13.1.1 we have 

H 2 (Pe 0 ,PeJ x ( 9 n -9 0 ) 2 
while in Example 13.1.2 we have 

H 2 (Pg o ,p 0n ) X \e n -e 0 \. 

Example 13.1.3 (Example 12.2.5, continued) Consider densities 
pg(x) = C(/3) exp{ —I* - df} 

and set 9o = 0. In this example, the following can be shown (see Le Cam and 
Yang (1990), Lemma 5 in Section 7.3). If /3 > 1/2, the family is q.m.d. and 
so H 2 (Po, Pg)/5 2 tends to a finite limit as S —» 0; thus, the right rate to keep 
the problem nondegenerate is 5 x n -1 ^ 2 . If f3 = 1/2, H 2 (Po, Es)/[^ 2 | log(5)|] 
tends to a finite limit as <5 —> 0, and so the corresponding nondegenerate rate is 
5 x (nlogn) -1 / 2 . If 0 < /3 < 1/2, H 2 (Po, P$)/S 1+2/3 tends to a finite limit, in 
which case the corresponding nondegenerate rate is 5 x n~ 1 ^ 1+2l3) . ■ 

Even though the above asymptotic development studies the limiting behavior 
of tests based on the criterion of minimum sum of error probabilities, it is also rel¬ 
evant to the usual Neyman-Pearson formulation when we we consider tests whose 
level is a for some fixed a > 0. For, if n.H 2 (Pg 0 , Pg n ) oo, then S(Pg 0 , Pg n ) — > 0, 
by Theorem 13.1.3. Thus, by Lemma 13.1.1, given e > 0, for large enough n there 
exists a test sequence (j)„ whose level is less than e and whose power against 9 n is 
at least 1 — e. So clearly, there exist level a test sequences whose power against 
9 n tend to one. 

On the other hand, if nH 2 (Pg 0 , Pg n ) —> 0, then no sequence of level a tests 
has limiting power against 9 n greater than a (Problem 13.6). 

As before, the interesting nondegenerate asymptotic situation occurs when 
nH 2 (Pg 0 , Pg n ) —> c for some finite positive c. In this case, there exists a level 
a test sequence whose limiting power against 9 n exceeds a. Typically, the value 
of the limiting power is strictly less than one, but in some cases it may equal 
one (which does not contradict Theorem 13.1.3 because the sum of the errors is 
tending to a > 0); see Problem 13.9. 

The following theorem clarifies the relationship between P„, and Q n being 
contiguous and the Hellinger metric between P n and Qn. 


Theorem 13.1.4 (i) If nH 2 (P„,Q n ) 0, then ||Qn —P”||i —> 0 and {P„} and 
{Qn} are contiguous. 

(ii) If nH 2 (P n ,Q n ) — > oo, then S(Pn,Q„) —> 0 and {P™} and {Qn} are not 
contiguous. 
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Proof. To prove (i), note that Theorem 13.1.3 holds if Pg 0 is allowed to vary 
with n, with no change in the argument or the conclusion. Thus, by (13.21) with 
ci = 0, nH 2 (P„,Q n ) 0 implies S(P£,Q™) —> 1. Therefore, by Problem 13.10, 
||P" — Qn ||i —> 0. To prove (ii), assume nH 2 (P n ,Q n ) —»- oo. By Theorem 13.1.3, 
S\Pn,Qn) —t 0. Hence, there exists a test sequence (j>n such that Ppn(()>*) —> 0 
and EQn(<f)n) —y 1. Let L n denote the likelihood ratio of Q ™ with respect to P". 
But, Theorem 13.1.1 shows that <j>n can be taken to be the indicator of the set 
A n = L n i> 1. Then, P™(A n ) —> 0 but Qn(A n ) —1. M 


Example 13.1.4 (Example 13.1.1, continued) Assume {Pe, 9 £ H} is q.m.d. 

at do , and h„ h. Then, by a calculation similar to that in Example 13.1.1, 
rtH 2 [Pg o+hn _ 1/2j Pg 0+hn n-i/ 2 ) 0 (Problem 13.11). Therefore, by Theorem 
13.1.4(i), Pg Q+h n -i /2 is contiguous to Pg Q . This result forms the basis for gen¬ 
eralizing results such as Theorem 12.2.3, Theorem 12.4.1 and Corollary 12.4.1, 
which have been shown to be true when h n = h, to the more general case when 
h„ — > h ; see Problems 13.12 and 13.13. ■ 

In the intermediate situation nH 2 {P n , Q n ) x 1, P" and Qn ma y ° r may not be 
contiguous. Example 13.1.1 provides an example where contiguity holds. However 
reconsider Example 13.1.2, where P„ is uniform on [0,1] and Q n is uniform on 
[0,1 + /in -1 ], where h > 0. Then, nH 2 (P n ,Q n ) x 1, but Q ” is not contiguous 
with respect to P„ ■ To see why, let A n be the event that the maximum of n 
i.i.d. observations exceeds 1. Then, P"(A n ) = 0, while Qn(A n ) —> 1 — e~ h . For 
a sharp result on the relationship between contiguity and Hellinger distance, see 
Oosterhoff and van Zwet (1979). 


13.2 Asymptotic Relative Efficiency 

Consider the problem of testing H : 6 £ Ho against 9 (f: Ho when X \,..., X n are 
i.i.d. according to a model { Pg , 8 £ H}. Our main goal is to derive tests that are 
asymptotically optimal. However, other considerations (such as robustness) may 
suggest using non-optimal tests. It is then important to know how much is lost 
by the use of such sub-optimal tests. In this section, we shall therefore compare 
the performance of two test procedures </> n and <j> n - In this context, performance 
is measured in terms of power. Roughly speaking, the relative efficiency of <)>„ 
with respect to (j> n is defined to be n/n , where n and h are the sample sizes 
required for <)>„ and (j>n to have the same power at the same level against the 
same alternative. For instance, a ratio of 2 would indicate that <f>„ is twice as 
efficient as 4> n because twice as many observations are required for (j> n to have 
the same power at a given alternative as <j>n■ Such a comparison can be based on 
the following result. 

Theorem 13.2.1 Suppose Xi, ..., X„ are i.i.d. according to a q.m.d. family in¬ 
dexed by a real parameter 9, and consider testing 8 = 8o versus 9 > 9o- Assume 
the sequence 4> = {(fin) is based on test statistics T n satisfying the following: there 
exists a function p{-) and a number a 2 > 0 such that, under any sequence 9 n 
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satisfying n}^ 2 (9 n — 9o) = 0(1), 

n 1/2 [T n - g(6 n )]A N(0,o 2 ) ; (13.24) 

moreover, g(-) is assumed to have a right-hand derivative g'(9o) > 0 at 9 q. 
Suppose 4> n rejects when nf^ 2 \T n — g(9o)] > c„, where 

Cn —t Zl-aO (13.25) 


in probability under 9 o. Then, the following is true. 

(i) Ee 0 (<fn 


a as n —» oo. 


(ii) The limiting power of (f> n against 9 n satisfying n)^ 2 (9 n — 9o) —» h is 

u'(9 0 ) 


lim Ee n (4> n ) = 1 — 4> 


zi- a - h- 


(13.26) 


(in) Fix 0 < a < (3 < 1. Let 9k be any sequence satisfying 9k > #o and 9k —1 6q 
as k —» oo and let nu be any sequence for which Ee k ((/)n k ) > /?. Then, 2 


(Zl — a Z\~g)“G 

\(9k - 9 o )n'(0o)] 2 ' 


(13.27) 


Proof. Part (i) follows by Slutsky’s Theorem. To prove (ii), let 9 n satisfy 
n 1 ^ 2 (9 n — 9 o) —> h. By contiguity (Example 13.1.4, it follows that c n —} z\- a o in 
probability under 9 n . Also, 

n 1/2 [g(d n ) - g(9 0 )] hg!(9 q) . 

Letting Z denote a standard normal variable, by Slutsky’s Theorem, 

Ee n (4>n) = Pe n {n /2 [T n - /*(#»)] > c n - n 1/2 [g(9 n ) - g(6 0 )]} 


-t P{crZ > zi- a o - hg(0 o )} , 


implying (ii). 

To prove (iii), choose h = hg so that the right side of (13.26) is (3, and hence 
hg = ( Zl - a - Zl -p) ■ • 

By (ii), if 9 n satisfies n 1 ^ 2 (9 n — 9 o) —» hg, then the limiting power of (f> n against 
9 n is p. It follows that the limiting power of 4> n against 9 n is /? if and only if 9 n 
satisfies 

(Zl-a ~ Z 1 -g) 2 0 2 

[(9 n -9 0 )g'(9o)] 2 ' 

For an arbitrary sequence 9k 9o, let mk satisfy m 1 J 2 (9k — do) —1 hg. Then, 
since m].^ 2 (9k — 9o) = 0(1), the asymptotic normality assumption for holds, 
and the above argument shows the limiting power of (j> mk against 9k is fi iff 

(zi- a - Z\-g) 2 a 2 

mk ~ {(9 k -e 0 )g'(9o)] 2 ’ 


2 The notation a & b^ means a^/bk —> 1. 
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To show rik ~ rrik, we first show that limsup(nfc/mA;) > 1. But, the q.m.d. as¬ 
sumption precludes nk being bounded (Problem 13.17), while the above argument 
shows the limiting power against rik would be bounded above by f3 if nt —> oo. 
So, it suffices to show liminf(rik/mk) < 1. Fix e > 0 and let Sk satisfy 

Sfc /2 (#fc — Oo) -t {z\-a — 21-/3) • s + e • 

T (Oo) 

Note that Sk/mk < 1 + Ce for some C. Then, the limiting power of <j> Sk against 
6k is, by the above argument, strictly greater than (3. Hence, for large enough n, 
nk < Sk, and so 

lim inf < lim inf < 1 + Ce . 
m k mk 

Since e was arbitrary, the result follows. ■ 

Inspection of (13.26) shows that, the larger the value p'(9o)/a, the smaller is 
the sample size required to achieve a given power /3. A test sequence generated 
by T„ will therefore be more efficient the larger its value of [fx(9o)/o\. This value 
is called the efficacy of the test sequence. Under some regularity conditions, Rao 
(1963) proved that 

[p'(9 0 )/a(9 0 )f < I(0 O ) , 

where 1(9 0 ) is the usual Fisher Information. Such a result will follow from the 
results in Section 13.3 under the assumption of quadratic mean differentiability. 


Example 13.2.1 (Wald and Rao Tests) Under the assumptions of Theorem 
13.2.1, suppose 9 n satisfies (12.62),and consider the Wald test that rejects for 
large values of 9 n — 9q. By Theorem 12.4.1, the assumptions of Theorem 13.2.1 
hold with p(9) = 9 and a 2 = I~ 1 (9o). (The theorem establishes asymptotic 
normality under sequences 9 n of the form 6o + hn~ 1 ^ 2 , but it holds more generally 
for sequences 9 n satisfying n 1 ^ 2 (9 rl — 9o) = 0(1), by Problem 12.32.) Hence, the 
squared efficacy of the Wald test is I(9o). The same is true for Rao’s score test 
(Problem 13.18). ■ 


Corollary 13.2.1 Assume the conditions of Theorem 13.2.1 hold for <j> = 
and consider a competing test sequence <j> = {<f> n } based on a test statistic T n 
satisfying (13.2f) with p. and a replaced by p and a. Fix 0 < a < (3 < 1 and for 
9 > 9q, let N(9) and N(6) be the smallest sample sizes necessary for <j> and 4> to 
have power at least (3 against 9. Then, 


lim N(9) . \ p’(6 0 )/d 
OT>o N(6) [p'(0 0 )/o 


(13.28) 


and the right hand side is called the (Pitman) Asymptotic Relative Efficiency 
(ARE) of <f> with respect to <f>. 


Proof. Apply (iii) of Theorem 13.2.1. ■ 

Notice that the ARE is independent of a and (3. Also, the tests are only required 
to be asymptotically level a, and the critical values may be random. Thus, we 
can, for example, compare tests based on an exact critical value, such as one 
obtained from the exact sampling distribution of T„ under 9q, with tests based 
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on asymptotic normality, possibly combined with an estimate of the asymptotic 
variance. Another possibility is to use a critical value obtained from a permutation 
distribution, such as the tests studied in Section 5.12. Nevertheless, under the 
assumptions stated, the resulting efficacy of a test is unchanged whether a test is 
based on an exact critical value or an approximate one. This implies the ARE is 
one when comparing two tests based on the same test statistic but with different 
critical values, as long as (13.25) is satisfied. 

The ARE provides a single number for comparing two tests, independent of a 
and /3. However, for finite samples, the relative efficiency depends on both a and 
P- Thus, the asymptotic measure may not give a very good picture of the actual 
finite-sample situation. 

The following lemma facilitates the computation of the efficacy of a test 
sequence. 

Lemma 13.2.1 Assume X\,... ,X n are i.i.d. according to a family which is 
q.m.d. at do and that the unknown parameter 6 varies in an open subset of IR. 
Suppose, under 9 n = do + h/n 1 ^ 2 , we have 

n 1 ' 2 ^ -4 N(hm, a 2 ) . 

Then, the assumptions in Theorem 13.2.1 hold for T n and the efficacy of T n is 
m/a. 

Proof. Let p(d) = m(d — do). The assumptions imply 
n 1/2 (T n - p{d n )) A N(0,a 2 ) 

under d n whenever 9 n is of the form d n = do + hn~ 1/ ' 2 . By Problem 12.32, the same 
result holds whenever rf^ 2 (d n — do) = 0(1), so that the asymptotic normality 
assumption holds for T n with p! (d) = m. Thus, the efficacy of T n is m/a. ■ 

Example 13.2.2 (One-sample Tests of Location) Suppose X\,..., X n are 
i.i.d. according to a location model with density /(x — d), where / is assumed 
to be symmetric about 0. Assume f'(x) exists for almost all x, and the Fisher 
Information is positive and finite, so that the family is q.m.d. We would like to 
compare competing tests for testing 9 — 0 versus 9 > 0. Consider the three tests 
that reject for large values of t„, S n , and W n , the classical t-statistic f„, the sign 
test statistic S n , and the Wilcoxon signed rank statistic W„ studied in Examples 
12.3.9, 12.3.10, and 12.3.11, respectively. Regardless of whether or not / is known, 
all three tests can be used to yield tests that are pointwise consistent in level as 
long as / is symmetric and has finite variance. Let a 2 denote the variance of /. 
Under 9 n = h/n 1 ^ 2 , we have 

n 1/2 t n 4 JV( —,1) , 

<?f 

n 1,2 S n 4 N(hf( 0), i) , 

/ OO 1 

f 2 (x)dx,-) . 

-no ^ 


and 
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Thus, the efficacies of f, S, and W are 1/a, 2/(0), and (12) 1 / 2 J f 2 , respectively. 
Therefore (with an obvious change of notation that shows the dependence on /), 

es,t(f) = [2/(0)a/] 2 

and 

e W ,t(f) = 12a 2 [ J f 2 ] 2 . (13.29) 

In particular, when / is the normal density tp, e s,t(v 3 ) = 2/7T « 0.637 and 
eir,t(i/>) = 3/n « 0.955. Thus, under normality, the sign test requires a sam¬ 
ple size that is about 57 percent greater than the f-test to achieve the same 
power. 

On the other hand, the efficiency loss for the Wilcoxon test is less than 5 
percent. When / is not normal, the efficiency of both the sign test and the 
Wilcoxon test with respect to the f-test can be arbitrary large. To see this, modify 
c p by moving small masses out in the tails of the distribution so that cjj becomes 
quite large but /(0) and f f 2 remain about the same. Moreover, the Wilcoxon test 
can never be much less efficient than the f-test, regardless of /; in fact (Problem 
13.21), 

e w ,t(f ) > 0.864 for all / . (13.30) 

Interestingly, when / is the double exponential density, the sign test is the most 
efficient of the three. In fact, it will later be seen in Section 13.3 that the sign 
test is asymptotically uniformly most powerful for testing the location parameter 
in a double exponential location model. ■ 

Example 13.2.3 (Two-Sample Tests of Shift) Suppose Xi ,..., X m are i.i.d 
with c.d.f. F and, independently, Yi,, Y n are i.i.d. with c.d.f. G. Assume 

G(x) = F{x-0) (13.31) 

for some 9. If F is unknown, such a nonparametric two-sample shift model was 
studied in Section 5.8, where the class of permutation tests was introduced. Con¬ 
sider the problem of testing 9 = 0 versus 9 > 0. We would like to compare 
the normal scores test and the Wilcoxon test W introduced in Section 6.9, as 
well as the two-sample f-test and the permutation f-test. It turns out that, even 
when F and G are normal with a common variance, the normal scores test and 
the Wilcoxon test are nearly as powerful as the f-test. To obtain a numerical 
comparison, suppose m = n. Then, the notion of relative efficiency applies with 
no changes (by viewing the observations as pairs (A/,Yi)), and so the (Pitman) 
asymptotic relative efficiencies can be computed for test statistics satisfying the 
assumptions of Theorem 13.2.1. 

In the particular case of the Wilcoxon test, ew,t = 3/ir when F and G are 
normal with equal variance. Some numerical evidence supports the fact that the 
relative efficiency is nearly independent of a and (3 in this context; see Lehmann 
(1998), p.79. As in the one-sample case, the (Pitman) asymptotic relative effi¬ 
ciency is always > .864, but may exceed 1 and can be infinite. The situation is 
even more favorable for the normal-scores test. Its asymptotic relative efficiency, 
relative to the f-test, is always > 1 under the model (13.31); moreover, it is 1 only 
when F is normal. Thus, while the f-test is performance robust in the sense that 
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its level and power is asymptotically independent of F as discussed in Section 
11.3, the present results show that the efficiency and optimality properties of the 
t- test are quite nonrobust. The same comments apply to the permutation t-test 
(whose asymptotic properties will be discussed in Section 15.2. 

The above results do not depend on the assumption of equal sample sizes; they 
are also valid if m/ri x 1. At least in the case that F is normal, the asymptotic 
results given by the (Pitman) efficiencies agree well with those found for small 
samples. The results also extend to testing the equality of s means, and the 
asymptotic relative efficiency of the Kruskal-Wallis test to the normal theory F- 
test is the same as the Wilcoxon to the t-test in the case s = 2. For a more detailed 
discussion of these and related efficiency results, see for example, Lehmann (1998), 
Randles and Wolfe (1979), Blair and Higgins (1980), and Groeneboom (1980). 

The most ambitious goal in the nonparametric two-sample shift model would be 
to find a test which does not depend on F, yet would have asymptotic efficiency at 
least 1 with respect to any other test, for all F (or at least all F in a nonparametric 
family). Such adaptive tests (which achieve simultaneous optimality by adapting 
themselves to the unknown F) do in fact exist if F is sufficiently smooth. Their 
possibility was first suggested by Stein (1956b), and has been carried out for 
point estimation problems by Beran (1974), Stone (1975) and Bickcl (1982). ■ 

We now briefly mention some other notions of asymptotic relative efficiency. 
Consider two test sequences <j> = {< f> n } and <j> = {<j>n}, each indexed by the sample 
size n. For simplicity, suppose <j> is determined by a test statistic T = {T n } which 
rejects for large values. Then, <j> n is really a family of tests indexed by n and 
a, where the value a determines the size of the test. Define N(a, (3 , 9) to be 
the sample size necessary for the test (j>n to have power > (3 against the fixed 
alternative 9, subject to the constraint that the size of 4> n is a. Thus, N is the 
smallest sample size n such that, for some critical value c = c(n, a), we have 

sup Pe 0 {T n > c} < a (13.32) 

0()£^O 

and 

Pe{T„ >c}>(3 . 

Similarly, define N(a,/3,9) corresponding to a test 4>n based on a test statistic 
T n . Then, the relative efficiency of (j> with respect to <f> is defined to be 

ef T (a, (3, 9) = N(a, (3 , 9)/N(a, f3, 9) . 

While this measure has a useful statistical interpretation, its value depends on 
three arguments a, (3 and 9\ moreover, it is typically quite difficult to compute 
N(a, (3 , 9) for a given test <j>. However, it is often possible to calculate the limiting 
values of ef T (a, (3,9) as a — > 0, f3 — » 1, or 9 — > 9q £ fto> with the remaining 
two arguments kept fixed. The case a —¥ 0 is known as the Bahadur efficiency, 
the case (3 —¥ 1 as the Hodges-Lehmann efficiency, and the case 9 —» 9q coincides 
with the (Pitman) ARE already introduced. These various types of efficiency are 
reviewed in Serffing (1980, Chapter 10) and Nikitin (1995, Chapter 1). While 
each of these notions of asymptotic relative efficiency have some merit, we argue 
that the Pitman ARE has the most practical significance. In practice, a, though 
small, is regarded as fixed, and so comparisons based on the Bahadur efficiency 
with a — > 0 may be questionable. On the other hand, with a fixed, comparing 
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procedures with power tending to 1 seems inappropriate since then the probability 
of an error of the second kind now becomes smaller than the probability of an 
error if the first kind. Typically, for values of the parameter at a fixed distance 
from flo, any reasonable test will have power tending to one. It then becomes 
more important to choose a test that is better equipped to deal with the more 
difficult situation when 9 is near Slo, and the Pitman asymptotic relative efficiency 
provides a useful measure in this situation. Numerical evidence for the superiority 
of Pitman over Bahadur efficiency is provided in Groeneboom and Oosterhoff 
(1981). 


13.3 AUMP Tests in Univariate Models 

Suppose Xi,...,X n are i.i.d. Pg , with 9 real-valued, and consider testing the 
hypothesis 9 — 9o against 9 > 9q. As was discussed in Section 3.4, even in this 
one-parameter model, UMP tests rarely exist. In the present section we shall 
show that under weak smoothness assumptions, asymptotically optimal tests do 
exist. 

As we saw in Section 13.1, when the q.m.d. assumption holds, informative 
power calculations for large samples are obtained not against fixed alternatives 
(for which the power tends to 1) but against sequences of alternatives of the form 

9 n ,h = 9 0 + hn~ 1/2 h > 0 , (13.33) 

for which the power tends to a value strictly between a and 1. Asymptotic 
optimality is most naturally studied in terms of these alternatives. 

Let {a n } be a sequence of levels tending to a. By the Neyman-Pearson Lemma, 
the most powerful test (pn,h for testing 9 = 9q against 9 Ut h at level a n rejects when 

n 

L n ,h = J^[[P8 0 + ft,„-l /2 (Xi)/pg 0 (-X))] 

i=l 

is sufficiently large; more specifically, it is given by 

{ 1 if log (L n , h ) > c n , h 

^n.h if log(Z/„,/i) — Cn,h (13.34) 

0 if log (L n ,h)<c n ,h, 

where the constants c n ,h and "f„,h are determined so that Eg 0 ((j) n ,h) = 

The limits of the critical values c„,/j and the power of the tests (13.34) against 
the alternatives (13.33) are given in the following lemma, under the assumption 
of quadratic mean differentiability. 


Lemma 13.3.1 Assume {Pg, 9 £ S2} is q.m.d. at 9o with SI an open subset of 
JR. Consider testing 9 = 9 q against 9„ t h = 9 q + hnat level a n —> ot £ (0,1). 
(i) As n —¥ oo, the critical values c n ,h of the most powerful test sequence <f> n ,h 
defined in (13.34) satisfy 


Cn,h 


-h 2 I(9 0 ) +hI i/2 {eo)zi _ 


2 


(13.35) 
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where I(Oo) is the Fisher Information at 9 q and zi- a = <f> 1 (1 — a) is the 1 — a 
quantile of N{ 0,1)- Moreover, 

Pe 0 {\og(L n:h ) > c„,h} ->• a (13.36) 

and 

Po 0 {log(L n , h ) = c n: h} -t 0 . (13.37) 

(ii) The power of <j> n ,h satisfies 

Eeo+hn- 1 / 2 ^™^) -> 1 — $[zi- a — /i/ 1/ “(6»o)] . (13.38) 

(Hi) More generally, consider testing 8 = 8 q against 9 n ,h n where h n —» h, with 
\h\ < oo. Then, the power of <t> n ,h n against 8 n> h n converges to the right side of 
(13.38), i.e., it has the same limiting power as (j> n ,h- 

Proof. By Theorem 12.2.3, under do, log {L n ,h) converges weakly to IV (—a 2 /2, a^ 
where = h 2 I(8 o). Then, (13.37) follows by Problem 11.42(i). Hence, 

Otn — FOq ((fin.h') — P&Q A) ^ Cn,/i} T o(l) , 

and so (13.36) follows. By Problem 11.42(h), it follows that c„ : h tends to the 
1 — a quantile of N(—o^ l /2,of l ), and so (13.35) follows. 

To prove (ii), under 9n,h, log(L„^) converges in distribution to a variable 
Yh distributed as N(o 2 /2,a1), as shown in Example 12.3.12 by a contiguity 
argument. Hence, under 9o + hn~ 1 ^ 2 , the probability that log(Tn,h) = c n ,h tends 
to 0, again by Problem 11.42(i). Letting Z denote a standard normal variable, 

Po n h (firi.h') — Pf9 n }, {leg i,L n ^h) ^ Cro,/i} T o(l) 


-»• P{Y h > + a hZl - a } = P{Z > -a h + zi- a } = 1 - $(zi_« - hI 1/2 (8 0 )), 

and (ii) follows. 

The proof of (iii) is left to Problem 13.27. ■ 

Next, we consider the notion of an asymptotically most powerful test sequence 
for testing a simple hypothesis 9 = 8q against a simple alternative sequence 8 n . 

Definition 13.3.1 For testing 9 = 8o against 9 = 9 n , {<(>„} is asymptotically 
most powerful (AMP) at (asymptotic) level a if limsup„ Eg 0 (</>„) < a and if for 
any other sequence of test functions {ipn} satisfying limsup n Ee 0 (ifn) < a, 

limsup Eg n (ipn) ~ Ee n ((fn) < 0 . (13.39) 

n 

For q.m.d. families, Lemma 13.3.1 implies the following result (Problem 13.28). 

Theorem 13.3.1 Assume {Pg , 8 £ 12} is q.m.d. at 9o with 12 an open subset of 
]R and Fisher Information I(9o). Given Xi ,..., X„ i.i.d. Po, consider testing 8 = 
#o against 9 n = 9 q + hnU -1 ^ 2 , where h n -> h > 0. Then, (f>„ = ... ,X n ) 

is AMP level a if and only if Ee 0 (4> n ) —> a and 

limsup Eg o+hn -i /2 (</>„) = [1 - $(zi- a - hI 1/2 (9 0 ))\. 


(13.40) 
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Of course, for testing a simple null hypothesis against a simple alternative, one 
always has available the optimal finite sample Neyman Pearson test sequence <j> n ,h 
given by (13.34). However, the tests <j> n ,h will typically depend on h and therefore 
will not be uniformly best against all alternatives. However, at this point, there is 
a profound difference between the finite sample and the asymptotic theory. Most 
powerful tests typically are unique while this is not true for asymptotically most 
powerful tests, since they can be changed on sets whose probability tends to zero 
without changing the asymptotic power. This difference opens up the possibility 
that among the set of AMP tests there may be one that is AMP simultaneously 
for all values of h. This possibility will be explored in the remainder of this 
section. 

For this purpose, recall the expansion of log(L Tli / l ). By Theorem 12.2.3, 

log(L n ,k) - [hZ n - \h 2 I{9 o)] = Opn (1) , (13.41) 

where rj(x,9) = 2 r]{x,9)/p 1 J 2 {x), ri(-,9) is the quadratic mean derivative at 9, 
and Z n is the score statistic given by 

n 

Z n = n- 1 / 2 Y J V{Xi,9 0 ), (13.42) 

i= 1 


By Problem 12.24, the left hand side of (13.41) tends in probability to 0 not only 
under the null hypothesis but also under the alternative sequence Pg +hn - 1/2 
as well. Hence, the test that rejects for large values of log (L„ t h) should behave 
approximately like the test that rejects for large values of hZ n — ^h 2 I(9o). But, 
this latter test is equivalent to rejecting for large values of Z n , regardless of the 
value of h. 

Consider therefore the Rao’s score test <f> n given by 


1 if Z n >I 1 / 2 (9 0 )zi- a 

0 otherwise. 


(13.43) 


As discussed in Section 12.4.3, <j> n maximizes the derivative of the power function 
at 9o, and we will soon see that the limiting power of (j> n against alternatives of 
the form #0 + hn ~ 1 ^ 2 is the optimal value given by the right side of (13.38). 

We now derive the asymptotic properties of (/)„. Although we could argue 
by comparing <j> n with <j> n ,h, we proceed instead with a direct calculation. First 
observe that, under 6 q, Ee 0 (j) n ) a. To see why, note that, under 9o, Z n —> 
N(0,1(9o)), by Theorem 12.2.3. The asymptotic consistency in level follows by 
Slutsky’s Theorem. 

Next, we calculate the limiting power of (f>„ against an alternative sequence 
9n,h n with h n —t h < 00 . By Corollary 12.4.1, under the alternative sequence 
$0 T h n n 1 ^‘ J , 

Z n A N{hI{9 0 )J{9 0 )) . (13.44) 


+ — Pe 0 + h n n- 1 / 2 {Xn > / ^ ( 9 o)zi- a } 

,Z n - hi( 60 ) ^ I 1 / 2 (9o) Zl - a - hl(9 0 ) 1 

e 0 +h n n~l/n Jl/2 (0 Q ) - /!/2(6l 0 ) i 


Therefore, 
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-»• P{Z > zi_ Q - hl 1 / 2 (d 0 )} = 1 - $[ 2 i_ a - hI 1 / 2 { 6 0 )\. (13.45) 

Thus, (j> n has the same limiting power against 6 n ,h n as 4> n ,h n - Moreover, the 
convergence to the limiting power is uniform over h in [0, c] for any c < oo; that 
is, 

sup \e 9 +hn -i/2(4> n ) ~ {1 - $[zi- a - hl 1/2 {e 0 )}}\ -s- 0 (13.46) 

0 <h<c I I 

as Ti —y oo. For if not, there would exist a sequence h n £ [0, c] for which 

Ee 0 +hnn -V>(tn) - (1 - - hl 1 / 2 (e o)]} (13.47) 

does not converge to 0. Then, there exists a subsequence h nj for which (13.47) 
converges along this subsequence to <5 ^ 0. Take a further subsequence h nj which 
converges to a limit, say h. But by (13.45), along every subsequence h njk which 
converges to h, we have 

E e 0+hn . „-va(0 1 - *[*!-« hI 1/2 (do)} , 

Ok Ok 

which renders a contradiction. In summary, we have proved the following. 


Lemma 13.3.2 Under the assumption of Lemma 13.3.1, let 4> n be the test 
(13-43). Then, (f)„ is asymptotically level a and its limiting power against 
do + hn ~ 1 ' 2 converges to the optimal limiting power uniformly in h £ [0, c] for 
any c > 0; specifically, (13.46) holds. 

Lemma 13.3.2 asserts an optimality property for (j> n . This notion of optimal¬ 
ity is appropriate for q.m.d. families since the optimal limiting power against 
sequences of the form #0 + hn ~ x ^ 2 is nondegenerate, i.e., strictly between a and 
1. Even for q.m.d. families, the conclusion of Lemma 13.3.2 does not imply uni¬ 
form optimality against all alternative sequences with h unrestricted to all of 
1R. We would now like to define a general notion of asymptotically uniformly 
most powerful of a test sequence </>„ satisfying lim sup Eg 0 {(pn) < a. A natural 
definition might be to require that, for any other test sequence ifn satisfying 
lim sup Eo 0 (ipn) < a, we have 

limsup[£<?(V>n) - Ee{(j>n.)) < 0 

n 

for all 6 . This definition does not work because most tests are consistent, i.e., for 
any fixed 6 , both Eg((j> n ) and Eg(ip n ) tend to one, and hence the difference will 
tend to zero. To avoid this difficulty, we will require 4> n to behave well uniformly 
across 9, which implies that <j> n must behave well against local alternatives 6 n 
converging to 6 q at an appropriate rate. Of course, under the q.m.d. assumption, 
it was seen in Section 13.1 and in Lemma 13.3.1 that the nondegenerate rate 
corresponds to 9 n — 9o x n -1 ^ 2 . 

Following Wald (1941a, 1943) and Roussas (1972), we therefore define an 
asymptotically uniformly most powerful (AUMP) test sequence. 


Definition 13.3.2 For testing 9 — 60 against 9 > 9q, a sequence of tests {4>n} 
is called asymptotically uniformly most powerful (AUMP) at (asymptotic) level 
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a if limsup n Eg 0 {(j> n ) < a and if for any other sequence of test functions {?/>„} 
satisfying limsup n Eg 0 (ip n ) < a, 

limsup sup {Eg(ip n ) - Ee(<j> n ) ■ 9 > do} < 0 . (13.48) 

n 

Equivalently, (j> n is AUMP level a if limsup„ Eg 0 (rf) n ) < a and <j) n is AMP 
against any sequence of alternatives { 9 n } with 9 n > 0 (Problem 13.29). Note that 
this definition is not restricted to q.m.d. families; it also easily generalizes further 
to problems with nuisance parameters; see (13.71). Also, note that the definition 
differs slightly from those of Wald and Roussas in that we allow tests that are 
not exactly level a for finite n, as long as the lim sup of the size is bounded above 
by a. Of course, we will typically consider tests meeting the stronger requirement 
Ee 0 {4>n) —> a, but we prefer not to rule out a priori tests that do not satisfy this 
convergence. 

A slightly weaker notion than Definition 13.3.2 is the following. 

Definition 13.3.3 For testing 9 — 9q against 9 > 9q, a sequence of tests {<f n } 
is called locally asymptotically uniformly most powerful (LAUMP) at level a if 
lim sup n Eg 0 (<j> n ) < o. and for any other sequence of test functions {tpn} satisfying 
limsup n Eg 0 (ip n ) < a, 

limsup sup {Eg{tp n ) — Ee(4> n ) : 0 < n 1 ^ 2 (6> — 9o) < c} < 0 (13.49) 

n 

for any c > 0. 

In (13.48), the sup over {9 : 9 > #o} can be reparametrized as the sup over 
{h : 9 q + hn _1 ^ 2 > 0}. Hence, condition (13.48) can be rewritten as 

limsup sup{.E 9o+Jm -i/ 2 (t/>n) - E eo+hn -i/ 2 (<j) n ) : h > 0} < 0 

n 

and (13.49) can be rewritten as this same expression with the sup over h > 0 
replaced by the sup over (0 < h < c}. In view of Lemma 13.3.1, under q.m.d., 
we can express the conditions for a test sequence <j>n to be AUMP or LAUMP in 
terms of the limiting values of its power against local alternatives. 

Theorem 13.3.2 Consider testing 9 = 9 q against 9 > 9o in a q.m.d. family 
with nonzero Fisher Information I(9o). If <f> n = ■ ■ ■, X„) is any sequence 

of tests based on n i.i.d. observations such that Eg 0 ((j> n ) —> a, then 

lim sup E 6o+hn -i /2 (<j> n ) < [1 - $(zi-a - hI 1/2 (9 0 ))\. (13.50) 

n 

Moreover, 4> n is A UMP at level a if and only if 

sup|£ 0 +Jln _ 1 / 2 OM - [1 -$(«i-a - M 1/2 (6 »o))]| -t 0 (13.51) 

h> 0 

and <j>n is LAUMP if and only if, for every c > 0 ? 

sup \E eo+hn -i, 2 {<l>n) - [1 - - hI 1/2 (9o))]\ -s-0 . (13.52) 

c>/i>0 

Lemma 13.3.2 asserts that (f> n defined by (13.43) is not only AMP, but LAUMP. 
We now obtain necessary and sufficient conditions for a test to be LAUMP, as 
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well as a sufficient condition for a test to be AUMP. The results are summarized 
as follows. 

Theorem 13.3.3 Consider testing 9 = 9q against 9 > 9o in a q.m.d. family with 
nonzero Fisher Information I(9o). Let 4>n be the test defined by (13.)3). 

(i) . Then, cj) n satisfies (13.52) and so is LAUMP at level a. 

(ii) . Any test sequence (/)„ satisfying, under 9 q, 

)>n - 4>n 4 0 (13.53) 

is also LAUMP at level a. 

(in). For <j> n to be LAUMP at level a, the condition (13.53) is also necessary, 
(iv). If, in addition, Z n —» oo in Pg ■-probability whenever n 1 ' 2 (9 n — 9o) —» oo, 
then <j>n is also AUMP at level a. 

Proof. The proof of (i) follows from Lemma 13.3.2 and Theorem 13.3.2. To prove 
(ii), the condition (13.53) ensures the limiting size requirement. By contiguity, 
under 9 n ^h n , <j>n — (fn —> 0 in probability whenever h n < c. It follows that 

E e 0 +h n n- 1 / 2 ( < t > ™) ~ E e 0 +/i n n-V 2 (^») 0 

whenever h„ < c, which implies 

sup \E e +hn -l/2(4>n) - Eg +hn - 1/2 (4>n)\ -» 0 , 

0<h<c I I 

and (ii) follows. 

To prove (iii), fix h > 0 and consider the sequence of alternatives 0 n ,h- Let 4> n 
be the indicator of the event 

_ (J 2 

L n ,h > k = exp(—^ + <ThZl- a ) , 

where = h 2 I(0o). Then, cj> n is LAUMP level a by (ii) (from the asymptotic 
normality of log(L„)). Suppose (jA ,j is also LAUMP level a. By Problem 13.30, 
Eg 0 (</>n) —1 ol. Then, letting pg denote the joint density under 9 and letting p n 
denote a measure dominating pg 0 and pg n h , 

J (</>n - <t>V)(jPe n<h - kpe 0 )dp„ -1 0 . 

But, the integrand in the above equation is always nonnegative. Hence, the 
integral over the set where {pg Q > 0 } also tends to 0 , so that 

J {<in - rf>n){L n , h - k)pe 0 dfj, n ->• 0 . 

Since the integrand is nonnegative, it follows (by Markov’s inequality) that for 
every 77 > 0 , under 9o, 

Pe 0 {\4>n - 4>*n\ ■ I Ln,h ~ k\ > Tj} -> 0 . 

We want to conclude that, for any e > 0, 

Pe 0 {\4>n — <t> n \ > e} —> 0 . 

But, for any <5 > 0, 

Pe 0 {\4>n - <j>V[ > c} = Pe 0 {\4>n ~ 4>n\ > I Ln,h - k\ > (5} 


(13.54) 
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+Pe 0 {\<f>n — 4>V\ > U | L n ,h — fe| < 5} . (13.55) 

As n —» oo, the last term tends to a limit c(<5); moreover, c(8) —» 0 as <5 —» 0 
since L n ,h has a continuous limiting distribution under Oo. Thus, the last term 
in (13.55) can be made arbitrarily small if 8 is chosen small enough, whereas the 
first term is bounded above by 

P« 0 {\(t>n — 4*n\ ' \L n ,h t 0 

by (13.54) with = e5, and the result follows. 

To prove (iv), if the result were false, there would exist a sequence 0 n such 
that v}t 2 (O n — do ) —> oo and Eg n (<f> rl ) does not converge to one. But, 

EeMn) = Pe n {Zn > I 1/2 (6 0 )z 1 - a } -+ 1 
by the added assumption. ■ 


Example 13.3.1 (Location Models) Suppose Pg has density with respect to 
Lebesgue measure on the real line given by f(x — 6), for some fixed /. Assume 
the conditions of Corollary 12.2.1 to ensure the family is q.m.d., so that f exists 
almost everywhere (with respect to Lebesgue measure), 


I = 



in *)] 2 

/(*) 


dx 


is finite and positive, and the quadratic mean derivative is 


V(x,0) = 

Then, the score statistic reduces to 


1 f'(x-0) 
2p/*(x-0) ’ 


Zn = —n 1/2 y, 

i=1 


f'(Xi - e 0 ) 
f(Xi-Oo) ' 


The test (13.43) is LAUMP level a. It is also AUMP level a if / is strongly 
unimodal (Problem 13.36); in this case, Example 1 of Section 8.2 shows that the 
test is also UMP if n = 1. ■ 


Example 13.3.2 (Double Exponential Location Family) As a special case 
of the previous example, let f(x) = \ exp(—1*|). Then, 1(6 ) = 1. Without loss of 
generality, consider 6q = 0. Then, 

n 

Zn = n~ 1/2 ^2 sign(Xi) , 

i=l 

where we take sign(x) = 1 if x > 0 and sign(*) = —1 otherwise. The resulting 
test which rejects when Z n > z\- a is LAUMP at level a. Moreover, this test is 
AUMP at level a as well. Although this follows from the previous example (since 
/ is strongly unimodal), we give a direct proof. Note that 

Varg(Z n ) = Varg [sign(AT)] < E s {[sign(Xi)] 2 } = 1 . 

Hence, to show Z n —» oo in Pg n -probability if n 1,/2 #n —» oo, it is enough to 
show that Eg n (Z n ) —> oo (by Chebyshev’s inequality and the previous bound for 
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Varg(Zn)- see Problem 13.31). Letting F denote the c.d.f. with density /, we 
have 

Ee n (Z n ) = 2 n 1/2 [F(0 n ) - F( 0)] = n 1/2 [l - exp(-0„)] -*■ oo , 
and the result follows. 

In the double exponential location model, a MLE is a sample median 
the test that rejects the null hypothesis if n 1/l2 9 n > Z\- a is also AUMP and is 
asymptotically equivalent to the test based on Z n in the sense that the probability 
that both tests lead to the same conclusion tends to 1, both under the null 
hypothesis and against a sequence of contiguous alternatives (Problem 13.32). ■ 

The following example shows that, without strong unimodality, a LAUMP test 
need not be AUMP in the location model of Example 13.3.1. 


Example 13.3.3 (Cauchy Location Model) Here, f(x ) = [-7r(l + a: 2 )] 1 and 
f [x) = —2*7r _1 (l + a: 2 ) -2 . Let 6 0 = 0. Then, 


Z n = 2n~ 1/2 

i=l 


Xi 

1+x 2 • 


By Theorem 13.3.3, since 1(6) = 1/2, the Rao score test that rejects when Z n 
exceeds zi-a/v^ is LAUMP at level a. However, this test is not AUMP at level 
a. To see why, first note that, for any large B > 0, Pg{Xi > B} — > 1 as 9 —» oo, 
and so, with n fixed, 


Pe{min(AT,..., X n ) > B} 1 

as 6 —> oo. Since, x/(l + x 2 ) is decreasing in x on the set {x > 1}, this implies 
that, for any z > 0, 


Pg{Z n > zj —> 0 as 8 — » oo 

(13.56) 

and thus, for any c > 0, 


lim inf Pg{Z„ > z} = 0 . 

n-»oo „l/2 s > c 

(13.57) 


But, even the worst case power cannot be below a for an AUMP test. 

Thus, the score test based on Z n cannot be AUMP. Next, compare the test 
based on Z n with the test that rejects for large values of X n , the sample median. 
By Theorem 11.2.8, under Pg, 

n 1 / 2 (X n -e)AN(o, 7r pj . 

Furthermore, since X n is location equivariant, the distribution of n 1 / 2 (A n — 6) 
under 8 does not depend on 8. Consider the asymptotically level a test that 
rejects when ^ z\- a . We have 

inf Pg{n 1/2 X„ > 1 r -zi - a } = inf Pg{n 1/2 (X n - 8) > ^ z\- a - n 1/2 8} 

„1/2«> C 2 nl/2 0 > c 2 


= inf P 0 {n 1/2 X„ > ^-zi- a — n 1/2 8} 
n'/ig^c 2 


Po{n 1/2 X n > ^ 21 -a - c} , 
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which, asn-> oo, tends to 

1 — <E> ^Zi- a — — ^ > a > 0 . 

Note, however, the test based on X n is neither LAUMP nor AUMP, though its 
power tends to one uniformly over {9 \ 9 > 5} for any 5 > 0. 

However, AUMP tests do exist in the present situation. One such test is the 
Wald test based on an efficient likelihood estimator. Actually, all that is required 
is a location equivariant estimator 9 n which satisfies 

n 1/2 (9 n -9) 4 N(0,1~ 1 (9)) , (13.58) 

where in this case I~ 1 (9) = 2. Indeed, the above argument with 9 n replacing X n 
applies with the asymptotic variance of X„ of 7 t 2 /4 replaced by 2. 

As mentioned in Section 12.4.1, a difficulty in constructing an efficient likeli¬ 
hood estimator is due to the fact that the likelihood equation may have multiple 
roots. In order to deal with this situation, let £ n (0) = log (L n (9)). Define 

L = Xn + Mill . (13.59) 

nI(X n ) 

The construction is based on the fact that the nearest root to a consistent esti¬ 
mator is efficient (under regularity conditions which hold for this model). Instead 
of determining the closest root exactly, which involves solving £' n (9) = 0, a lin¬ 
ear approximation to (expanded about X n ) is used; see Section 6.4 of 

Lehmann and Casella (1998). By Corollary 4.4 in Section 6.4 of Lehmann and 
Casella (1998), 9„ satisfies (13.58). The test that rejects when n 1 ^ 2 ^ > 2 1 ^ 2 z\- a 
therefore is AUMP (Problem 13.33). ■ 

Example 13.3.4 (Wald Tests) As Example 13.3.3 shows, a AUMP test can 
be based on an efficient estimator, resulting in the Wald tests introduced in 
Subsection 12.4.2. Actually, this holds more generally. Assume the conditions of 
Theorem 13.3.3. Suppose 9 n satisfies (12.62). For testing 9 = 9o versus 9 > 9o, 
the test <fi n that rejects when n 1 ^ 2 (0„ — 9o) > zi- a I~ 1 ^ 2 (6o) is LAUMP level a. 
Indeed, the expansion (12.62) implies that fin — fin —> 0 in probability under 9q, 
so that (j> n is LAUMP by (ii) of Theorem 13.3.3. To show fi n is AUMP as well, 
it is enough to show n 1/,2 (#„ — #o) —> oo under 9 n whenever n}^ 2 (9 n — #o) —1 oo; 
the argument is similar to (iv) of Theorem 13.3.3. This last condition holds in 
any location model if 9„, is location equivariant (Problem 13.34). ■ 

Example 13.3.5 (Correlation Coefficient) Let Xi = (Ui,Vi) be i.i.d. bivari¬ 
ate normal with zero means, unit variances, and unknown correlation p. For 
testing p = 0 versus p > 0, we saw in Example 12.4.4 that Rao’s score test rejects 
for large values of 

n 

Z n =n~ 1/2 Y J UiV i . 

i= 1 

By Theorem 13.3.3, this test is LAUMP. To show it is also AUMP, we must show 
Z n —> oo in probability under p n whenever n 1 ^ 2 p n —> oo. Now, 

Epn{Zn) = n 1/2 pn OO 
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and 

Var Pn (Z n ) = Var Pn (UiVi) < E Pn (U?V?) = E Pn [V?E Pn (U? |Vi)] . 

But, the conditional distribution of Ui given V\ is N(p n Vi, 1 — pi) and so 

E pn (Ul\V\) = plv? + (1 - pi) < v? + 1 . 

Hence, 

Var Pn (Z n ) < E^V? + V?) < 4 . 

The result now follows by Chebyshev’s inequality; see Problem 13.31. ■ 

It is important to recognize that no asymptotic method, efficient or not, can 
perform well in all situations. Some anomalies with the Wald test are discussed 
in Vaeth (1985), Mantel (1987), Le Cam (1990), Benichou, Fears and Gail (1996) 
and Pawitan (2000). We also remark that, for two-sided hypotheses, AUMP tests, 
or even LAUMP tests, typically do not exist (Problem 13.39), but an asymptotic 
approach based on asymptotic unbiasedness is fruitful (Problem 13.55). 

When 9 = (9\,... ,9k), it is natural to next consider one-sided tests of 9\ 
in the presence of nuisance parameters 62 ,...,9k. One approach to finding an 
upper bound for the limiting power of a test sequence is to fix the nuisance 
parameters and apply the results of this section. The resulting bounds need not 
be attainable by any method. A more general approach that leads to bounds 
which are attainable is discussed in Section 13.5. 


13.4 Asymptotically Normal Experiments 

In the previous section, a fairly direct approach was taken to compute the best 
limiting power of a sequence of tests. Since the problem there was reduced to 
testing a simple hypothesis versus a simple alternative, an optimal test could be 
derived via the Neyman-Pearson Lemma for finite sample sizes, which resulted 
in a calculation of the optimal limiting power. Implicit in the calculation was 
the fact that the likelihood ratios behave approximately like those in a normal 
location model. More explicitly, given n i.i.d. observations from a q.m.d. family 
{Pe}, when testing 9 = 9q versus 9 = 80 + the optimal test rejects for 

large values of the likelihood ratio L n ,h- By Theorem 12.2.3, L n ,/i satisfies 

log(!/„,(,) - [hZ n - ^h 2 I(9 0 )] = op» (1) , (13.60) 

where Z n is the score vector 

n 

Zn = 2n ~ 1/2 ri(X i ,9 0 )/p 1 e / o 2 (X i ) 

i= 1 

and rj (•, 9 0 ) is the quadratic mean derivative at #o- By contiguity, the left side of 
this expression tends to 0 in probability under Pg 0+hn -i /2 as well. The asymptotic 
power calculations flow from these results. 

An alternative (and more general) approach is based upon a deeper connection 
between the expansion (13.60) and the exact likelihood ratios for a particular 
normal location model. Specifically, consider the normal location model where 
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you observe an observation X from the normal location family {Qh, h £ IR}, 
where Qh is the normal distribution with unknown mean h and known variance 
I-'iOo). Let Lh denote the likelihood ratio dQh/dQo{X). Then, 

log(Lfc) = hZ- \h 2 I(d 0 ) , (13.61) 

where Z = I(9o)X. Hence, the loglikelihood log(I/ n ,/ l ) given by (13.60) behaves 
similarly to log (Lh)’, the former is approximately quadratic in h, it is linear in Z n , 
the coefficient of h 2 is nonrandom, and Z n is asymptotically normal N(0,1(do)). 
These approximations are exact for the normal experiment with Z n replaced by 
Z. In a certain sense, the experiments {Pg 0+hn -i/ 2 , h £ IR} and {Qh,h £ IR} 
are close to each other. Le Cam (1964) formalized the notion of experiments 
being close, and he showed some profound consequences. 3 For our purposes, we 
would like to show that, corresponding to any test </> n based on X \,..., X n from 
{Pg +hrl -i/ 2 }, there exists a test <j> for the normal location problem such that the 
power functions are approximately the same, as functions of the local parameter 
h. Then, since an optimality result is available for the normal location model (like 
a UMP test in the one-sided testing problem), this will directly lead to an upper 
bound for what is achievable asymptotically in terms of power for the testing 
problem based on n observations from {Re}. 

Consider the approximating normal experiment consisting of observing one 
observation X from N(h, I~ 1 (do)), for which 9q is viewed as fixed. If Z = I(9o)X, 
then Z is an observation from Qh, where Qh = N (hi (do), I (do)) . Clearly, the 
Information contained in X is the same as that of Z. Thus, we could equally 
well view the two experiments {N(h,I~ 1 (9o)),h £ IR} or {N(I(do)h, I(do))} as 
limiting approximations to the experiment {Pg 0+hn -i /2 , h £ IR}. The former 
representation consisting of observing X from N(h, I~ 1 (do)) seems more natural 
since the unknown parameter h refers to the mean of X. On the other hand, the 
experiment of observing Z from N(I(do)h, I(9o)) directly matches Z n in (13.60). 
The point is that either experiment applies since they are equivalent. 

This approach works, not only for one-parameter problems with no nuisance 
parameters, but also for more general testing problems where the hypothesis 
concerns a real-valued parameter in the presence of nuisance parameters, and 
multiparameter problems. For this purpose, we first give the definition of an 
asymptotically normal sequence of experiments. Consider a sequence of statistical 
models {Q n ,h,h £ IR fe }. (This can easily be generalized to the case where h is 
only defined for a subset Q n of IR fc which can vary with n.) Thus, for a given 
n, there is available data on the (measure) space (X n ,T n ) where the probability 
distributions Q n ,h live. 


3 The term experiment rather than model was used by Le Cam, but the terms are 
essentially synonymous. While a model postulates a family of probability distributions 
from which data can be observed, an experiment additionally specifies the exact amount 
of data (or sample size) that is observed. Thus, if {Pg, 9 £ IR} is the family of normal 
distributions N(9, 1) which serves as a model for some data, the experiment {Pg, 9 £ IR} 
implicitly means one observation is observed from N(9, 1); if an experiment consists of 
n observations from N(9, 1), then this is denoted { . 0 £ IR}. 
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Definition 13.4.1 For a sequence of experiments {Q n ,h,h £ IR fc }, let L U: h 
denote the likelihood ratio of Q n ,h with respect to Q n , o, defined by (12.36). 
Suppose there exists a sequence of random k- vectors Z n mapping X n to IR fc and 
a k x k positive definite symmetric matrix C such that 

log (L n ,h) = (h,Z n ) - i( h,Ch ) +OQ n0 (l) (13.62) 

and Z„ -4 N(0,C) under Q n ,o■ Then, the sequence {Q„ : h,h £ IR fc } is called 
asymptotically normal. 

If {Qh} denotes N(Ch, C), the fc-variate normal distribution with mean vector 
Ch and covariance matrix C, then we also say that {Q n ,h, h £ IR fc } converges to 
the limiting experiment {Qh}- The terminology may be confusing, since Q n ,h is 
not asymptotically normal (and, in fact, Q n ,h typically has a distribution on a 
space that varies with n); it is the log likelihood ratios from the experiment that 
are asymptotically normal. In particular, note that if L(h) denotes the likelihood 
of an observation Z from Qh, then 

log(L(fc)/L(0)) = (h,Z n ) - h,Ch) ; 

that is, the right side of (13.62) without the error term is exact for the 
(multivariate) normal location model. 

Example 13.4.1 (Quadratic Mean Differentiable Families) Suppose the 
family {Pg, d £ 17} is q.m.d. at do- Let Q n ,h = Pg 0+hn -i /2 all< l ^ ~ I(do). By 
Theorem 12.2.3, {Q n ,h} is asymptotically normal with covariance C and Z n the 
score vector as defined in (12.59). Because we are now parametrizing by the local 
parameter h, we sometimes speak of {P^ g+hrl -i/ 2 } as being locally asymptotically 
normal at do, and the terms asymptotically normal and locally asymptotically 
normal are used interchangeably. ■ 

The random vector (sequence) Z n defined by (13.62) is called the score vector. 
Note, however, that any Z n for which Z n — Z n —> 0 in probability under Q„,o 
also satisfies (13.62). 


Example 13.4.2 (Two-Sample Problems) Suppose that X \,..., X rn are i.i.d 
according to Pg, d £ fi, where Q. is an open subset of IR fc . Independently of the 
X's, suppose Yi,...,Y n are i.i.d. according to Pg, d £ SI. Suppose both fami¬ 
lies are q.m.d. at d 0 . Thus, {P^ +hm -i/ 2 ,h £ IR fc } and {Pg 0 +hn -i/ 2 , h £ IR fe } 
are each asymptotically normal with corresponding Zm and Z n satisfying as 
m, n —» oo, Z m -4 N(0,1 (do)) and Z n -4 N(0,1 (do)) under do- Let L m ,h be 
the likelihood ratio dP^ g+hrn _ 1 / 2 /d,Pg^ based on Xi,...,X m , and let L ni h be 
the corresponding likelihood ratio based on Yi,... ,Y n . Then, for the combined 
experiment (and noting hn ~ x / 2 = hm~ 1 ^ 2 (m/n) 1 / 2 ), 


log 


d ( p ? 0+hn -v* x 

d{PZ X P?) 


+ log i L n,h) (13.63) 


= ( h(m/n) 1 / 2 ,Zm ) - i —<h, J(0 o )h) + (h,Z n ) 
Z n 


~(h,I(do)h) + 

°P™ X Pn 

Z 6 o e o 


(i) 



552 13. Large Sample Optimality 


— (h, (m/n) 1 ^ 2 Z m + Z n ) — -(h, (—i(#o) + I{Qo))h) + o P m xP n (1) . 

2 n e o e o 

If we assume that m/n —» A < oo, this last expression equals 

(h, A 1/2 Z m + Z n ) - \(h, \I(8 0 ) + I(0 O )) + O p rn x p n (1) . 

z e 0 0 O 

Thus, the experiment sequence {Pg^ +hm - 1/2 x Pg 0+ h n - 1 / 2 } is asymptotically 
normal with covariance C = C{6 q) = A/(#o) + 1(9 o)- ■ 

Some properties of an asymptotically normal experiment sequence are the 
following. First, Q n ,h is contiguous to Q n , 0 , since under Q n ,o, log(!/„,/,) —» 
N(-^-,a 2 ), where a 2 = ( h,Ch ), so that Corollary 12.3.1 applies. I 11 fact, the 
expansion (13.62) implies that Q n ,h 1 and Qn,h 2 are mutually contiguous for any 
hi and /12 (Problem 13.41). It also follows by Corollary 12.3.2 that, under Q„ t h, 

Z n A N(Ch, C) (Problem 13.42). 

We are now in a position to relate a testing problem for an asymptotically 
normal {Q n ,h} to one for the normal experiment {N(Ch,C)}. 

Theorem 13.4.1 Suppose {Q n ,h, h £ IR^ } is an asymptotically normal sequence 
of models with covariance matrix C. Let (j> n he a test, i.e., a function defined on 
X n , the space where the probabilities Q n ,h live, with values in [0,1]. Let /3 n (h) de¬ 
note the power of <j>„ against Q„ } h■ Then, for every subsequence {nj}, there exists 
a further subsequence {nj m } and a test <j> in the limiting experiment {N(Ch, C)} 
(or equivalently, the experiment {N(h, C -1 )}) such that, for every h, 

Pn lm ( h ) —>• /3(h) , 

where (3(h) is the power of <j>. 

Proof. Let Z n be the vector appearing in the definition (13.4.1), so that under 
Qn, 0 , Z n —> IV(0, C). Since <f> n £ [0,1], {(/>„} is tight. Hence, under Q n ,o, (4>n, Z n ) 
is tight. By Prohorov’s Theorem 11.2.15, given any subsequence {%}, there exists 
a further subsequence {n Jm } such that 

^n jm ,Z njm )A(4>,Z) 

under Q nj , 0 , where Z denotes a random variable with distribution N(0,C) 
(independent of h) and </> £ [0,1]. Let L n ,h denote the likelihood ratio of Q n ,h 
with respect to Q n ,o- Then, by (13.62), under Q njm , 0 , 

4 C(<f>, exp((/i, Z) - \(h,Ch))) . 

If F(-, ■) denotes this limit law, then under Qn jm ,h, we have by Theorem 12.3.3, 
(<j>nj ,L n .h) converges to a limit law with density rdF(t,r). But since </>„ £ 
[0,1], weak convergence implies convergence of moments, so that 

/ ^m d Qn Jm ,h -»• J j trdF(t,r) = E[<f>exp((h, Z) - ^{h,Ch))\. (13.64) 

Define <j>(Z) = E(<j>\Z), i.e., the conditional expectation under the (fixed) joint 
distribution of (4>,Z). Then, the right side of (13.64) is equal to 

£W(Z)exp«fc,Z>- \(h,Ch))\ 
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= J <f>{z) exp((ft, z) - h,Ch))dN(0,C)(z ) . 

But, exp((/i, z) — | (ft, Ch))dN(0, C)(z) is actually the density of N(Ch, C) (Prob¬ 
lem 13.43). Hence, if the experiment consists of observing Z ~ N(Ch,C), then 
the last expression is 

E h [<t>{Z )]i = J 4>(z)dN(Ch, C)(z) . U 

Theorem 13.4.1 suggests the following strategy for obtaining asymptotically 
optimal tests in a variety of situations. First, an optimal test, say a UMP test, 
is derived (or quoted from an earlier chapter) and its power computed from an 
appropriate normal experiment. Second, the actual experiment sequence is shown 
(or known) to converge to the normal limiting experiment; as a result, the power 
of the normal model serves as an upper bound for the asymptotic power of the 
actual sequence. Finally, a test sequence is constructed whose asymptotic power 
attains the upper bound and which is therefore asymptotically optimal. A similar 
strategy will apply to constructing tests that are asymptotically maximin. The 
remainder of this section will illustrate this approach. 


13.5 Applications to Parametric Models 

13.5.1 One-sided Hypotheses 

We now apply Theorem 13.4.1 to the following situation. Suppose Xi,..., X n 
are i.i.d. Pg, where 9 varies in S2, an open subset of IR fe . Assume the family is 
q.m.d. with positive definite Information matrix 1(9). 

First suppose 9 = (9 i, ..., 9k) and consider testing 9i < 0 against 9i > 0 in the 
presence of nuisance parameters 02,. . ., 9k- Fix 9o = (#o,i, • • •, #o,fc) with #o,i = 0. 
We now derive an upper bound for the limiting power of a test </>„ satisfying, for 
hi < 0, 

limsup Eg o+hn -i/ 2 (<j>n) < a . (13.65) 

n—> oo 

By Theorem 13.4.1, we can approximate the power of <j> n by the power of (j> = 
4>(X), where X ~ N(h, I^ 1 (9o)). But then (13.65) implies 

E h <f>(X) < a if hi < 0 , 

i.e., <j>(X) is a level a test for testing hi < 0 against hi > 0 in the limit experiment. 
But, by Example 3.9.2, a UMP level a test exists for this problem and has power 
1 — 4>(zi_ a — hiip{(9o)). By Theorem 13.4.1, we can conclude that, for hi > 0, 

lim sup Eg o+hn -i /2 (<t>n) < 1 - $(zi- a - hiip}(9 0 )) . 

n 

More generally, let g be a function from U to 1R, and assume g is differentiable 
with gradient vector g(9) of dimension 1 x k. The problem is to test g(9) < 0 
against g(9) > 0. Suppose (j> n is a test based on Xi ,..., X n whose limiting size is 
a < a (see Definition 11.1.2). Fix 9q such that g(9o) = 0. We will derive an upper 
bound for the limiting power of (p n under 9q + hn~ 1 ' 2 and then obtain tests for 
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which this limiting power is attained. First, note that 

g(6 0 + hn~ 1/2 ) = n~ 1/2 (g(9o ) T , h) + o(n~ 1/2 ) . 

If h is such that (g(9 o) T , h ) < 0, then g(9o + hn -1 / 2 ) < 0 for all sufficiently large 
n. The assumption on the limiting level of (f> n implies that, for such an h. 

limsupE eo+hn - 1/2 (<j> n ) < a . (13.66) 

n—>oo 

Now, according to Theorem 13.4.1, we can approximate the power of a test 
sequence <j> n by the power of a test <j> = <f>(X) for the experiment based on 
X from the model N(h, I^ 1 (9o)). Let f3</,(h) denote the power of <j>(X) when 
X ~ N(h, I~ 1 (9o))- Then, (13.66) implies that /3$(h) < a if (g{0o) T ,h) < 0. 
Since is continuous, it follows that /3$(h. ) < a if (g(9o) T , h) < 0. Now, fix 
an alternative hi with (g(9o) T , hi) > 0. Theorem 13.4.1 implies that 

lim sup E eo +hin -i/ 2 {<p„) < sup Mhi) , (13.67) 

n—too <j)£A a 

where A a = {( f> : (3<j>(h) < a whenever (g(9o) T , h) < 0}. But then, the right 
side of (13.67) is maximized when <j> is the most powerful level a test for testing 
(g{9o ) T , h) < 0 against h = hi. In fact, for the problem of testing (g(9o ) T , h) < 0 
versus (g(9o) T ,h) > 0, there exists a uniformly most powerful test based on A' 
which rejects for large values of (g(9o) T , X); see Section 3.9.2. But, 

(. g(9 0 ) T ,X)~N({g(9 0 ) T ,h),a 2 eo ) , 

where 

a 2 e 0 = g(9o)r\9o)g(9 0 ) T . 

Hence, for testing ( g(9o) T ,h ) < 0 at level a, the UMP test rejects when 
{g{9o ) T , X) > zi- a ag 0 . The power of this test against h is then 

1 - $(zi_ a - crg o 1 {g{9 0 ) T ,h)) . 

Therefore, Theorem 13.4.1 implies that, for any h such that {g(9o) T , h), 

lim sup Eg o+hn -i /2 (<f> n ) < 1 - $(xi-a - veo{g{ e o) T ,h)) . (13.68) 

n 

The above development is summarized in (i) of the following theorem. Part (ii) 
asserts that an optimal test sequence may be constructed if an efficient estimator 
sequence is available. 

Theorem 13.5.1 Suppose Xi ,..., X n are i.i.d. according to Pg, 9 £ fi, where 
is assumed to be an open subset o/IR fc . Let flo denote the set of 9 with g{9) < 0, 
for some function g from IR^ to IR which is assumed differentiable with gradient 
g(9). Consider testing the null hypothesis 9 £ Q o versus g(9 ) > 0. Assume the 
family {Pg,9 £ fl} is q.m.d. at every 9 for which g{9) = 0 with nonsingular 
Fisher Information matrix 1(9). 

(i) Let 4>n = <t>n{X i,...,X„) be a uniformly asymptotically level a sequence of 
tests, so that 

limsupsupl7e((()„) < a , 


(13.69) 
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and suppose that g{6o) = 0. Then, for any h such that (g(9o) T , h) > 0, (13.68) 
holds. 

(ii) Let 9 n be any estimator satisfying (12.62) (such as an efficient likelihood 
estimator). Suppose 1(6) is continuous in 9 and g(9) is continuous at 9 q. Then, 
the test sequence 4> n that rejects when n 1 ^ 2 g(0n) > zi~ a o n , where 

arl = g(d n )I- 1 (e n )9(0 n ) T , 

is pointwise asymptotically level a. Moreover, the inequality (13.68) becomes an 
equality, and the limsup on the left side of (13.68) may be replaced by a lim. 

Proof. The proof of (i) follows from the discussion preceding the theorem (ap¬ 
plying that argument to subsequences for which limits exist). The proof of (ii) 
follows from Theorem 12.4.1 and the discussion in Subsection 12.4.2. ■ 

In fact, the properties claimed in (ii) above hold more generally for any test 
sequence that rejects if T n > t n , if T n satisfies 

T n — g(9 0 )I \9 0 )Z n ,e 0 + op" (1) 

for every 9q £ Uo, where Z Hj g 0 is the score vector defined in (12.59), and if 

, p 

tn ~t Zl—aO0 0 

under 9q , where ag 0 is given by (12.66). 


Example 13.5.1 (One-sample Normal Model) Let Xi,X n be i.i.d. nor¬ 
mal with mean g and variance a 2 so that 9 = (g,o 2 ). Consider testing g < 0 
versus g > 0. Of course, the usual f-test is UMPU and UMPI. Theorem 13.5.1 
applies immediately to the test 4> n that rejects when n 1 / 2 X n /S„ exceeds zi- a . 
Therefore, for any a, 

= 1 ~ $(zi- a - hia' 1 ) , (13.70) 

and so 4> n is LAUMP. Equation (13.70) also holds for the t-test, i.e., when the 
normal critical value z i- a is replaced by the corresponding critical value obtained 
from the t-distribution with n — 1 degrees of freedom, which gives an asymptotic 
optimality property for the f-test that does not depend on the restriction to 
unbiased or invariant tests. In fact, we now show <j> n is AUMP. Specifically, in 
the case where there is a nuisance parameter a, it is natural to define a test <j> n 
to be AUMP level a if (f> n is uniformly asymptotically level a and for any other 
uniformly asymptotically level a test if, we have 

lim sup sup{ ,n(if>n) — Ep'ctyn) : g > 0, a > 0} < 0 . (13.71) 

n 

(Obviously, we would modify this definition if the nuisance parameter <r varied 
in a parameter space different from the positive reals.) To see that (/>„ possesses 
this property, argue as follows. If it did not, there would exist g n > 0 and cr n > 0 
such that 


limsup{E Mni0 - n (V>„) - E MtljCTn (4>n)j > 0 . 
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With a „ now fixed, let ip„ the UMP test for testing g < 0 versus yu. > 0 if a = a n 
is known. Since has greater power than ip n , it follows that 

limsup{£l Al „, CTn (^„) - CTn (</>«)} > 0 . 


But, 


E u 


.(V'n) = 1 - $(zi- a ~ n 1/2 fj,n/crn) 


Since the power of the t-test and the power of the test 4> n depend on (g, a) only 
through g/a, 


7H/z n ,cr n (0n) — E Mn. ,l(0n) - 

So, it suffices to show, uniformly in g and a = 1, that 

sup |1 - <f>(zi- a - n 1/2 g) - £ m ,i (</>„) | -*• 0 , 

AA>0 

or, for any sequence g n with g n > 0, 

E„ n , i(0„) - [1 - -f>(zi_ a - n 1/2 /x n )] -» 0 . (13.72) 


But, 

Eij, n ,i((t>n) = P^ n ,i{n 1/2 (x n - g n )/S„ > zi- a - n 1/2 g n /S n } . 

Under g = g n and a = 1, the left hand side r\}^ 2 (X n — g n )/S n has the t- 
distribution with n — 1 degrees of freedom, and so tends in distribution to Z 
which has the standard normal distribution. Also, S n —> 1 in probability. By 
Slutsky’s theorem, if n 1 ^ 2 g n —¥ 5 , then 


E/J, n ,l(<t>n) —^ P{Z > Zl — a — 5 } 

and (13.72) holds. If n x ^ 2 g n —> oo, then v}^ 2 (X n — gn)/S„ is still asymptot¬ 
ically standard normal, while zi- a — n 1 ' 2 g n /S n —> — oo in probability; then, 
E^n, i(</>n) —> 1 and (13.72) holds. To complete the argument, one must pass to 
subsequences such that n 1 ' 2 g n converges (possibly to oo) and apply the previous 
argument along such subsequences. The conclusion is that rf>„ is AUMP. ■ 


Consider the following special case of Theorem 13.5.1. Suppose 9 = (0i,... ,9k) 
and interest focuses on inference for 9 1 in the presence of the nuisance parameters 
O 2 , ■■■ ,0k- Specifically, consider testing 9\ = # 1,0 versus 9\ > # 1 , 0 - As usual, let 
1(9) denote the Fisher Information matrix with (i,j) entry denoted E,j(9); it is 
assumed 1(9) is invertible with inverse I^ 1 (9) having (i, j) entry [7 _1 (#)]<,.;. It is 
interesting to compare the power of the asymptotically optimal tests when the 
nuisance parameters are unknown with the situation in which they are known. If 
02 , ■■■ ,9k are fixed and known, then the best limiting power against the sequence 
of alternatives # 1,0 + bin -1 / 2 of an asymptotically level a test was obtained in 
Theorem 13.3.2, and is equal to 

1 - <&(zi_ a - 02,..., 9k)) • 

If the nuisance parameters are unknown, the best limiting power was obtained 
in Theorem 13.5.1; simply apply the theorem with g(9) = 9i, g(0) = (1, 0,..., 0) 
and h = (hi, 0,..., 0). The resulting limiting power value is equal to 

1 - ${zi- a - fci{/ _1 (0i,o, to, ■ ■ •. 9k)i,i}~ 1/2 ) • 
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Comparing these situations, we see that 

l^fi) * ’ 

since the power of the test when (02, ..., 9 k) are known exceeds that when 
(62, ■ ■ ■ , 9 k) are unknown. Equality holds if Ii,j(9) = 0 for all j ^ 1. Since the 
same argument applies to any of the components of 9, there is no loss in power 
when testing any component in the presence of the remaining parameters if and 
only if 1(9) is a diagonal matrix. 

Example 13.5.2 (Location Scale Models) Suppose X \,..., X n are i.i.d. witl 
density (J~ 1 f((x — p)/a), where / is absolutely continuous. Both the location 
parameter p and the scale parameter a are unknown. If 9 = (p,<r), then the 
components of the Information matrix are given by (Problem 13.44) 

j j® /(*)*,, 

and 

h, 2 = <r ~ 2 J x f(x)dx . 

It follows that the off-diagonal element /i ,2 is equal to 0 if / is symmetric. 

We specialize further and let f(x) = C(f3)exp(—\x\^) for some fixed f3. Recall 
from Example 12.2.5 that, if (3 > 1/2, then / generates a location model which is 
q.m.d.; the location scale model with a unknown is also q.m.d. (Problem 13.45). 
For (3 > 1, the MLE p, n for p is the unique minimizer of JT |-X) — for (3 = 1, 
any value between the middle order statistics is an MLE. Moreover, the unique 
MLE <j„ for a is given by 

a n = /3 1/f> . (13.73) 

n 

For testing /u < 0 against /r > 0, the Wald test which rejects for large values of 
f^n/cTn is LAUMP; If 1/2 < (3 < 1, Rao’s score test is more convenient to apply 
and is LAUMP (Problem 13.46). ■ 

Example 13.5.3 (Bivariate Normal Correlation) As in Example 13.3.5, let 
Xi = ( Ui,Vi ) be i.i.d. bivariate normal with unknown correlation p. However, 
here we assume the means and variances of 17/ and Vj are unknown as well. The 
MLE p n is given by the sample correlation (11.29). A LAUMP test rejects when 
f^^pn > Zi—a• Note that, in this case, the Information is not diagonal and the 
optimal limiting power is strictly smaller than the case where only p is unknown 
(Problem 13.47). ■ 

Theorem 13.5.1 can be generalized to two-sample problems, since the proof 
essentially only depends on Theorem 13.4.1 and the assumption that the experi¬ 
ment is asymptotically normal. By Example 13.4.2, asymptotic normality holds 
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for two-sample models if each of the one-sample models is quadratic mean differ¬ 
entiable. Specifically, suppose X\. ..., X rn are i.i.d. Pg, 9 £ fi, fi an open subset of 
]R fe . Also, suppose YJ.,..., Y n are i.i.d. Pg, 9 £ Q. Let 1(9) denote the Information 
an Xi contains about 9\ similarly, let 1(9) be the Information a Yj contains about 
9. Assume these Information matrices are nonsingular and continuous. Fix any 
#o and assume both models are q.m.d. at 9o with corresponding score statistics 
Zm and Z n (in the notation of Example 13.4.2). Then, the combined experiment 
is asymptotically normal with score statistic 

Zm,n = (m/n) 1/2 Zm + Z„ . 

If we also assume m/n —» A < oo, then the joint experiment is asymptotically 
normal with covariance 

C(9 0 ) = XI(9o) + I(9 0 ) . 

Consider testing g(9) = 0 versus g(9 ) > 0, for some continuously differen¬ 
tiable g with gradient g(9). A generalization of (13.68) yields for any uniformly 
asymptotically level a test sequence that (Problem 13.48) 

lim sup Eg o+hn -i /2 (<pn) < 1 - <f-(zi_ Q - Vg^(g(0o) T , h)) , (13.74) 

n 

where 

a 2 g 0 = g(9 o )C- 1 (9 o )g(0 o ) T . 

To find such a test, assume there exists an estimator sequence 9 n satisfying 

n /2 (L - 9 0 ) = C~ 1 (9 0 )Z m , n + o p rn x p„ (1) . (13.75) 

0 o d 0 

Then, the test that rejects when n 1 ^ 2 g(9„) > zi_ a cr„, where 

ol = g(L)C~ 1 (0 ri )g(L) T 

is pointwise asymptotically level a and the inequality (13.74) is an equality 
(Problem 13.49). 


Example 13.5.4 (Behrens-Fisher Problem) As a special case of the above, 
assume Pg is N(£,a 2 ) and Pg is N(r/,T 2 ) so that 9 = (£, 17 , a 2 ,t 2 ), and all four 
parameters vary freely. Consider testing 77 — £ = 0 versus 77 — f > 0, so that 
g(9) = 77 — £ and g(9) = (—1,1, 0,0). For this problem, neither invariance nor 
unbiasedness considerations reduce the problem sufficiently to obtain any kind 
of optimal test. However, a large sample optimality result is easily obtained. Fix 
9q = (£o,£c>i <t 2 ,t 2 ). Assume m/n —► A < 00 . Then, it is easy to check that the 
covariance matrix C in definition 13.4.1 is the diagonal matrix with diagonal 
elements A/a 2 , 1 /t 2 , 2A/a 2 , and 2/r 2 . Hence, 

a<? 0 = g(9 0 )C~ 1 (9 0 )g(9o) T = y + r 2 . (13.76) 

Thus, the bound in (13.74) with h = (hi, / 12 ,0, 0) reduces to 
Zi- a - (y + T 2 )~ 1/2 (h.2 - hi) . 


1 - T 
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It is easy to construct a test sequence that achieves this bound. Consider the test 
that rejects the null hypothesis when 

n 1/2 (y„ - X m ) > Zi—a I[Sy + (-)S 2 X 1 1/2 , 

L m J 

where Y n = n -1 J2j Yj > Sy = (n — l) -1 J2j(Xj — Yn) 2 , and similarly for X m and 
S x ■ This test is pointwise consistent in level; the order of error in the rejection 
probability will be revisited in Example 15.6.3. The limiting power of this test 
against the sequence of parameter values (£o + hin _1 / 2 ,£o + h 2 n^ 1 ^ 2 , a 2 , r 2 ) is 
given by 

f n 1/2 [(Y n - h 2 n 1/2 ) - (Xm - h in ~ 1/2 )\ _ ha-hi \ 

l > Zl ~° si + %s* x J ■ 

But, Sy —} t 2 in probability, S x —■> er 2 in probability, and the left hand side is 
asymptotically standard normal. The result follows by Slutsky’s theorem. ■ 


13.5.2 Equivalence Hypotheses 

In this section, we will apply Theorem 13.4.1 to the following situation. Suppose 
Xi,. .., X n are i.i.d. Pg where 9 £ and Q is an open subset of IR fe . Interest 
focuses on g(9), where g is a function from Q to IR. Assume g is differentiable 
with gradient vector g(9 ) of dimension 1 x k. We wish to test the null hypoth¬ 
esis |</(#)| > A against the alternative \g(0)\ < A. (We are tacitly assuming 
there exists values of 9 satisfying g(9) > A and g(9) < A.) This problem was 
studied in Theorem 3.7.1, where a UMP test was derived for a one-parameter 
exponential family. A UMP equivalence test for a linear combination of means of 
a multivariate normal distribution was obtained in Example 3.9.3. 

We will formulate the asymptotic problem in two distinct ways. First, we will 
consider the case when the null hypothesis parameter space is the complement of 
a fixed interval (—A, A). Then, we will also consider the case when this interval 
shrinks with n. 

(i). Fixed A. Suppose A > 0 is fixed and the problem is to test \g(Q)\ > A 
versus |g(#)| < A. For any fixed alternative value 9 with \g(0)\ < A, the power of 
any reasonable test against 9 will tend to one. Therefore, just as we did for one¬ 
sided hypotheses, we compare power functions against local alternatives. Consider 
any fixed 9o satisfying \g(9o)\ = A. For sake of argument, consider the case 
g(9o) = —A. We wish to derive an (obtainable) upper bound for the limiting 
power of a test sequence 4>n under 9o + hrY 1 ^ 2 . But a crude way to bound the 
power is based on the simple fact that any level a test for testing \g(9)\ > A 
versus 1^(0)! < A is also level a for testing g(9) < —A versus g(9) > —A. 
Since upper bounds for the asymptotic power were obtained in Theorem 13.5.1, 
an immediate result follows. In this asymptotic setup, the statistical problem 
is somewhat degenerate as it becomes one of testing a one-sided hypothesis. 
For example, suppose Xi,..., X n are i.i.d. N(9, 1) Then for large n, one can 
distinguish 9 < —A and 9 > —A with error probabilities that are uniformly 
small and tend to zero exponentially fast with n. In essence, the statistical issue 
arises only if the true 9 is near the boundary of [—A, A], in which case determining 
significance essentially becomes one of testing a one-sided hypothesis. 
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Theorem 13.5.2 Suppose Xi,, X n are i.i.d. according to Pe, 9 £ 12, where 
Q. is assumed to be an open subset of IR fc . Consider testing the null hypothesis 

9 £ fit) = {e : \g(0)\ > A} 

versus |g(0)| < A, where the function g from JR fc to IR is assumed differentiable 
with gradient g(9). Assume the family {Pg,9 £ 12} is q.m.d. at every 9 with 
\g(0)\ = A and assume the Fisher Information matrix 1(9 ) is nonsingular for 
such 9. Let <f> n = 4>n( A'i,... ,X n ) be a uniformly asymptotically level a sequence 
of tests; that is, 

limsupsup-E^^n) < a . (13.77) 

71—>00 Qq 

(i) Assume 9 q satisfies g(9o) = —A. Then, for any h such that (g(9o) T , h) > 0, 

lim sup Eg o+hn -i /2 (4>n) < 1 - $(zi- a - ag o 1 {g(9 0 ) T ,h)) , (13.78) 

n 

where 

a 2 eo =g(9o)r 1 (9o)g(9o) T . (13.79) 

(ii) Assume 0 q satisfies g(9o) = A. Then, for any h such that (g(9o) T , h) < 0, 

lim sup Eg o+hn -i/ 2 (<j> n ) < l-<L( 2 i- a -^ o 1 |( 3 ( 6 > 0 ) T ,/i)|) , (13.80) 

n 

(Hi) Let 9 n be any estimator satisfying (12.62). Suppose 1(9) is continuous in 
9 and g(9) is continuous at 9o- Then, the test sequence (f>„ that rejects when 
\g(9 n )\ < A - n _ 1 / 2 o-„ 2 i_ a , where 

On = g(9n)I~ 1 (9 n )g(9 n ) T (13.81) 

is pointwise asymptotically level a and is locally asymptotically UMP in the sense 
that the inequality (13.78) is an equality. In fact, the same properties hold for any 
test sequence that rejects if \T n \ < A — n~ x ^ 2 az\- a , ifT„ satisfies 

T n = g(9o)r 1 (9 0 )Z nt e 0 +o P n o (l) (13.82) 

for every 9o £ flo, where Z n p 0 is the score vector defined in (12.59). 

Proof. As remarked above, (13.78) follows because 4> n is also a uniformly asymp¬ 
totically level a test for testing g(9) < —A versus g(9 ) > —A. For this one-sided 
testing problem, the optimal bound was obtained in Theorem 13.5.1. The same 
argument applies to (13.80). To prove (iii), let 9 n = 0o+2m -1//2 . Then, assumption 
(12.62) and contiguity arguments imply that, under 9 n , 

n 1 / 2 (L-9 n )AN(Q,r 1 (9 0 )) . 

Thus, under 9 n , a n tends in probability to ag 0 . Moreover, the Delta method 
implies, under 9 n , 

n 1 / 2 (g(dn) - g(9 n )) 4 N(0, agj . 

Now, if g(9o) = —A and (g(9o) T , h) > 0, then 

g(9 n ) = -A + n~ 1 / 2 (g(9 Q ) T , h ) + o(n~ 1/2 ) . 
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So, under 9 n , 

n 1 / 2 [g(L) + A]AN( K (g(e 0 ) T ,h),a 2 e ^ . (13.83) 

Therefore, 

Ee n (4>n) = Pe n {\g{9 n )\ < A - n~ 1 / 2 d n z\- a } , 

which tends to the right side of (13.78) by (13.83) and Slutsky’s Theorem. The 
same proof works for any estimator T n of the form (13.82). ■ 

Example 13.5.5 (Normal One-Sample Problem) Suppose Xi,... ,X n are 

i.i.d. N(g,o 2 ), with both parameters unknown. Consider testing \g\ > A versus 
\g\ < A. The standard t -test for testing the one-sided hypothesis g < A against 
g > —A rejects if 

n 1/2 {X n + A )/Sn > tn— 1 , 1 —a , 

where S 2 is the (unbiased) sample variance and t n -i,i-a is the 1 — a quantile of 
the t-distribution with n — 1 degrees of freedom. Similarly, the standard t-test of 
the hypothesis g > A rejects if 

n 1 / 2 {X n - A )/S„ < -t n - i,i_o • 

The intersection of these rejection regions is therefore a level a test of the 
null hypothesis \g\ > A. Such a construction that intersects the rejection re¬ 
gions of two one-sided tests (TOST) was proposed in Westlake (1981) and 
Schuirmann (1981), and can be viewed as a special case of Berger’s (1982) 
intersection-union tests. The resulting test is denoted (f)n° ST that rejects when 
\X n \ < A — n,- 1 / 2 S n t n -i,i-a- (In fact, we see here that our general asymptotic 
construction in (iii) of the above theorem merely replaces the t n -i quantiles by 
the standard normal quantiles; that is, the intersection two rejection regions, each 
of asymptotic size a yields a rejection region whose asymptotic size is bounded 
above by a.) In general, by combining two one-sided tests, the resulting TOST 
can be quite conservative in that its size can be quite less than a. However, in 
this example, the size of (pn° ST is actually a, as can be seen by calculating the 
rejection probability under ( g,a ) with g = A and a —> 0 (Problem 13.53). The 
asymptotic power of rf)n° ST against a sequence with mean —A + hvT 1 ^ 2 (h > 0) 
and variance fixed at <r 2 is obtained by the previous theorem or calculated directly 
as 

P A+hn-V2 A\Xn\ < A - n~ 1/2 S n tn -!,!-«} = $(zi- a - -) , 

’ cr 

which is the optimal bound when (13.78) is specialized to this situation. A simi¬ 
lar calculation applies to sequences of the form A — hnT 1 ^ 2 . Thus, the TOST is 
asymptotically optimal in this setup. It should be remarked that the TOST has 
been criticized because it is biased (in finite samples) and tests have been pro¬ 
posed that have greater power; some proposals are discussed in Brown, Casella, 
and Hwang (1995), Berger and Hsu (1996), and Perlman and Wu (1999). Such 
tests cannot have greater asymptotic power against local alternatives, at least 
under the setup of Theorem 13.5.2. On the other hand, the TOST will be seen 
to be inefficient under the asymptotic formulation treated below. ■ 
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(ii) Shrinking A. We now consider a second asymptotic formulation of the prob¬ 
lem, in which the null hypothesis |<?(#)| > 5n ~ 1 ' 2 is tested against the alternative 
hypothesis \g(d)\ < 5n ~ x ^ 2 . Notice that now the parameter spaces (or hypothe¬ 
ses) are changing with n. Of course, a given hypothesis testing situation deals 
with a particular n, and there is flexibility in how the problem is embedded into 
a sequence of similar problems to get a useful approximation. In particular, if 
equivalence corresponds to |g(#)| < A, we can always make the identification 
6 = An 1 / 2 . From an asymptotic point of view, it makes sense to allow the null 
hypothesis parameter space to change with n, since otherwise the problem be¬ 
comes degenerate in the sense that the values of A and —A for g (6 ) can be 
perfectly distinguished asymptotically. In testing for bioequivalence, for exam¬ 
ple, A is chosen so small that a value of \g(d)\ < A is deemed to be essentially 
zero. In a particular situation such as Example 13.5.5 with a not too small, if a 
value for p of A cannot be perfectly tested against a value for g of 0, then A and 
—A cannot be perfectly tested as well, and the asymptotic setup should reflect 
this. 

The main result of this subsection is the following theorem. 


Theorem 13.5.3 Suppose Xi,..., X n are i.i.d. according to Pg, 0 £ 17, where 
17 is assumed to be an open subset of IR fc . Consider testing the null hypothesis 

9 £ n 0 ,„ = {e : |<7(0)| > <5n -1/2 } 

versus |g(0)| < <5n“ 1//2 , where the function g from IR fc to IR is assumed differ¬ 
entiable with gradient g(9). Assume for every 9 with g(9) = 0 that the family 
{Pg,9 £ 17} is q.m.d. at 9 and 1(9) is nonsingular. 

(i) Let 4> n = <t> n (Xi,...,X n ) be a uniformly asymptotically level a sequence of 
tests, so that 


limsup sup Eg{(/)„) < a . 

n—too Q 0 n 


Assume 9o satisfies g(9o) = 0. Then, for any h such that |(<?(0o) T , h)\ = S' < S, 
limsup E g +h 1/2 (4> n ) < $ (——- $ f— —, (13.84) 

n-Hx, 0 V a e 0 J \ a e 0 J 

where ag 0 is given by 


a e 0 = g(9o)I 1 (9o)g(9 0 ) T 


and C = C(a, S, ag 0 ) satisfies 

'C -6 


$ 


crg 0 


- $ 


-C-5 

U0Q 


(13.85) 


(13.86) 


(ii) Let 9 n be any estimator satisfying (12.62). Suppose 1(9) is continuous in 
9 and g( 8 ) is continuous at 9o- Then, the test sequence rf>„ that rejects when 
nl / 2 \d{9n)\ < C(a,S,a n ), where 

al = g(L)r\L)g(L) T , 


is pointwise asymptotically level a and is locally asymptotically UMP in the sense 
that the inequality (13.84) an equality. In fact, the same properties hold for any 
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test sequence that rejects if \T n \ < C(a,S,a„), if T n satisfies 
T n = g(9 0 )I 1 (9 0 )Z n ,g 0 + O pn(l) 

for every do £ flo, where Z n ,g 0 is the score vector defined in (12.59). 


Proof. Fix 9q satisfying g(9o ) = 0. We will derive an upper bound for the 
limiting power of a test sequence <j> n under 6 o + hn ~ 1,/2 . Note that 

g(9 0 + hn~ 1/2 ) = n~ 1/2 (g(9 0 ) T , h) + o(n~ 1/2 ) . 

So, if h. is such that \(g(9o) T , h}\ > 5, then \g(9o + /in _1 / 2 )| > 5n _1 ^ 2 for all 
sufficiently large n. Hence, if rf)„ has limiting size a, then for such an h, 


lim sup E eo+hn -i /2 (<j> n )< a . (13.87) 


By Theorem 13.4.1, we can approximate the power of a test sequence (j> n by 
the power of a test <j> = <j>(X) for the (limit) experiment based on X from the 
model N(h, I~ 1 (9o)). Let denote the power function of <j>(X) when A' ~ 

N(h, I~ 1 (6o)). Then, (13.87) implies /3^>(h) < a if \{g{9o) T , h)\ > 8. By continuity 
of /3^(/i), P<j,{h) < a for any h with \(g(9o) T ,h)\ > <5. The test <j> that maximizes 
f3<j,(h) for this limiting normal problem was given in Example 3.9.3 with E = 
o), £ = h, and a T = g(9o). Thus, if <f> is level a for testing \{g(do) T , h)\ > <5 
and h satishes \{g(9o) T , h)\ = 5' < 8, then 


Mh) < $ 


C-8' 

o-flo 


- $ 


-C-8' 

<?e 0 


and C = C(a,8,ag 0 ) satisfies (13.86). 

To prove (ii), consider the test that rejects when n}^ 2 \g(9 n )\ < C(a, 8, a n ). Fix 
h such that \(g(0o) T , h)\ < 8 and let 9 n = 9o + hn~ x ^ 2 . Then, as in the proof of 
Theorem 13.5.2 (iii), under 9„, 


n 1/2 [g(§„) -g(9 n )\ A N(0,ag o ) . 


But, 

n 1/2 g(9„) = (h,g(9 0 ) T ) + o(l) . 


Therefore, under 9 n , 

n 1/2 g{9 n ) A N {(h,g(9o) T ),o- 2 g 0 ' s J . 

Also, under 9 n , <t„ tends in probability to ag 0 , and so C(a, 8, <r„) tends in 
probability to C(a,8,og 0 ). Hence, letting Z denote a standard normal variable, 

Po n {n /2 \g(9 n )\ < C(a,8,& n )} -> P{\ag 0 Z + (h, g(0 o ) T }\ < C(a,8,ag 0 )} , 

which agrees with the right hand side of (13.84). ■ 


Example 13.5.6 (Normal Problem, Example 13.5.5, continued) Suppose 
A'i,..., X n are i.i.d. N(/x, a 2 ) with both parameters unknown, so that 9 = (fx, a). 
Let g(9) = [x and consider testing |/r| > 8 n~ x ^ 2 versus |ju| < 8 n~ x ^ 2 . By the 
previous theorem, for any test sequence <j> n with limiting size bounded by a and 
any h with |/?.| < <5, 


C-h 


- <f> 


-C-h 


Ehn - 1 /2 ,<r (</’") < $ 


(13.88) 
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where C = C(a,5,a) satisfies (13.86). A test whose limiting power achieves this 
bound is given by the test (fin that rejects when 

n 1 / 2 \X n \<C(a,5,S n ) , 

where is the (unbiased) sample variance (or any consistent estimator of 
a 2 ). On the other hand, the test (fin° ST given in example 13.5.5 is no longer 
asymptotically efficient. This test (with A = 8 n~ 1/l2 ) rejects when 

n 1 / 2 \X n \ < 8 - S„t n - 1 , 1 -c 

and has power against (/r, a) = (hn ~ x ^ 2 , a) given by 


P hn~ 1/2 


3 + Sntn — 1,1 — ot h „ 5 Sntn—l,l — cx h 

Zj n <c 


(13.89) 


where 


Z n = n 1 / 2 (X n - hn~ 1/2 )/a ~ N( 0,1) . 


Also, Sn —^ in probability and t n _i,i_ Q —¥ z\- a . By Slutsky’s Theorem, (13.89) 

converges to 

f —8 h „ 5 h I . . 

( a a a a) 

where Z ~ N(0, 1). Observe that this last expression is positive only if azi- a < 5; 
otherwise, the limiting power is zero! On the other hand, the limiting optimal 
power of (/)„ is always positive (and greater than a when |/i| < <5). Even when the 
limiting power of (fin° ST is positive, it is always strictly less than that of (fin ■ 
Note that the limiting expression (13.90) for the power of (fin° ST corresponds 
exactly to using a TOST test in the limiting experiment N(h,a 2 ) for testing 
\h\ > S versus \h\ < 8 with a known based on one observation X. In the limit 
experiment, the TOST procedure corresponds to the test that rejects if |A'| < 
5 — azi- a (which can be viewed as a TOST construction because its rejection 
region is the intersection of the rejection regions of the two one-sided tests of 
h < 8 and h > —8 ). But, for this limit experiment, the optimal UMP procedure 
of Section 3.7 rejects when |X| < C(a,8,a). In general, 

C(a, 8, a) > 8 — (JZ\- a 


(Problem 13.54), which shows that the test (fin of Theorem 13.5.3 is always more 
powerful than the asymptotic TOST construction of Theorem 13.5.2. ■ 


13.5.3 Multi-sided Hypotheses 

We now consider the problem of testing d = 9 q versus 9 ^ Q 0 as 9 varies in an 
open subset of IR fc . Theorem 13.4.1 relates this problem to testing h = 0 versus 
h ^ 0 based on an observation A' from the normal model N(h, J _1 (6 l o)), where, 
as usual, I(9o) is the Fisher Information. For this normal model, no UMP test 
exists, and Theorem 13.4.1 does not lead to an asymptotically UMP test sequence 
for the original problem. However, we will obtain an optimality result based 
on the maximin approach. Indeed, for this limiting normal model, an optimal 
maximin test exists, which allows one to construct an asymptotically maximin 
test sequence. 
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In order to have a nondegenerate asymptotically maximin procedure, it is nec¬ 
essary to consider alternatives at some distance from the null hypothesis, just as 
in the finite sample maximin theory. When testing based on n i.i.d. observations, 
this distance must shrink with n, in order to avoid a degenerate asymptotic the¬ 
ory, since there will typically exist test sequences whose asymptotic power tends 
to one uniformly over alternatives whose distance from 80 is fixed. It is convenient 
to consider this fixed distance as given by \I 1 ^ 2 (do)(d — #o)|, where | • | denotes 
the usual Euclidean norm of a vector in lR fc . For q.m.d. models, it will be seen 
that it is necessary to let this distance shrink at rate n ~ 1 ^ 2 in order to obtain a 
limiting minimum power greater than a and less than 1. 

In the following theorem, Ck, i- Q denotes the upper 1 — a quantile of the Chi- 
squared distribution with k degrees of freedom. 

Theorem 13.5.4 Assume Xi,.... X„ are i.i.d. Pe, where 6 varies in an open 
subset Q of IR fe . Assume this family is q.m.d. at do with positive definite Infor¬ 
mation matrix I (do). The problem is to test the null hypothesis d = do against 
d ^ do- Let 4>n = 4>n(Xi ,..., X n ) be any sequence of tests such that Ee 0 (<j> n ) —t a. 
Then, for any b > 0, 

limsupinf{-Eg 0+fer[ -i/ 2 (4> n ) : \I 1 / 2 (d 0 )h\ > 6} < P{\l(b 2 ) > c k ,i- a } , (13.91) 

n—too 

where Xk(b 2 ) denotes a random variable that has the noncentral Chi-squared 
distribution with k degrees of freedom and noncentrality parameter b 2 . 

Proof. Denote by f3 n (h ) the rescaled power function of <f > n , i.e., 

(3n(h) = E 9 Q+hn - U 2 (f> n ) . 

By assumption, /3„(0) —> a. Denote by R = R(a, b) the right hand side of (13.91). 
Now, argue by contradiction; that is, assume for the test sequence (f) n and some 
subsequence {n-,}, 

lim inf{/3„ (h) : \I 1/2 (d 0 )h\ > b} > R . 

rij —>-oo J 

Then, by Theorem 13.4.1, there exists a further subsequence nj m such that 

Pn jm (h) -+ 13(h) 

for every h, where (3(h) corresponds to a level a test of h = 0 versus h ^ 0 in 
the (limiting) experiment consisting of observing an X which is N(h, I~ 1 (do)). 
Thus, (3(h) > R for every h such that \I 1 ^ 2 (do)h\ > 6, which implies 

inf {f3(h) : \I 1 / 2 (9 0 )h\ >b}> R . 

This is a contradiction, since R is the maximin power for testing h = 0 versus 
\I 1 / 2 (9 0 )h\ > b based on X (Problem 8.29). ■ 

We first illustrate the theorem in the case k = 1. 

Example 13.5.7 (Simple vs Two-sided Alternative) Suppose X\,..., X n 

are i.i.d. Pe, d £ IR. Consider testing d = do versus d ^ do. Assume the family is 
q.m.d. at do. Let rf)„ be any test sequence satisfying Ee 0 (4> n ) —> a. By Theorem 
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13.5.4 with d = I 1 ^ 2 (9 q) b, an upper bound for the limiting maximin power over 
the complement of shrinking neighborhoods is given by 

limsupinf{S eo+an _i/ 2 (^„) : \h\ > d} < P{xi{I(9 0 )d 2 ) > ci,i_<*} . 

n 

In the one-sided case, an AUMP level a test (13.43) rejects for large values of the 
score statistic Z n given by (13.42). Consider the two-sided version (j > n< 2 of this 
test which rejects when I~ 1 (9o)Z 2 > ci,i_ a . Since I~ 1 (9o)Z 2 is asymptotically 
Chi-squared with one degree of freedom, this test is consistent in level. Moreover, 
its power function satisfies, for any 0 < d < D < 00 , 

ini[P eo+hn - 1/2 {r 1 {9 0 )Z 2 > ci.i—c} : d < h < D] 


-»■ P{xl(I(0 o )d 2 ) > Cl, 1-4 . (13.92) 

To see why, the convergence (13.44) implies that, under 9 n = 9q + h n n~ x ^ 2 , 

r 1 (9 0 )Z 2 ^xi(I(9o)h 2 ) . 

If (13.92) failed, there would exist h n satisfying h n —> h £ [d, D] such that the 
limiting power of (j> n , 2 against 9 n tends to 

P{xl(H0o)h 2 ) > Ci,i-a} < P{xl(I{9 0 )d 2 ) > Ci,i_a} . 

But, this last inequality is a contradiction since h > d and the family of Xii^p 2 ) 
with 1 f> 2 varying has monotone likelihood ratio (see Problem 7.4). It is typically 
possible to prove the stronger result with D in (13.92) replaced by 00 . This 
technical issue is the same as encountered in the one-sided case in Section 13.3 
when determining whether or not Rao’s score test is not only LAUMP but AUMP; 
see Theorem 13.3.3 (iv). For an alternative asymptotic optimality approach in 
the two-sided case, see Problem 13.55. ■ 

By a similar argument, we can prove the following optimality result for Rao’s 
test in the general k multi-sided testing problem. Analogous results hold for both 
the Wald and likelihood ratio tests (Problem 13.57). 

Theorem 13.5.5 Assume the conditions of Theorem 13.5.4- P° r testing 9 = 9q 
versus 9 ^ 9o, consider the test (fa that rejects when ZnI~ 1 (9o)Z n > Ck,i-a- 
Then, Eg 0 {(j)'4 l ) —> a and for any b and B satisfying 0 < b < B < 00 , 

inf {E eo+hn -kM) : b < \I 1/2 (9 0 )h\ < B} -+ P{ X l(b 2 ) > c k ,i- a } ■ (13.93) 

Proof. First suppose h n —> h with h satisfying \I 1/I2 (9)h\ > b. By the Continuous 
Mapping Theorem, under 9o + h„n~ 1 ^ 2 , Corollary 12.4.1 implies that 

Zlr\9 0 )Z n A xl{\I 1/2 {9o)h\ 2 ) . 

Hence, the limiting power of against such a sequence is 

P{xl(\I 1/2 (0 0 )h\ 2 ) > Cfc.i-c} > P{xl(b 2 ) > Ck,i-a} , (13.94) 

where the last inequality follows since the family of noncentral chi-squared distri¬ 
butions with fixed degrees of freedom and varying noncentrality parameter has 
monotone likelihood ratio. Now, if the result (13.93) were false, there would exist 
a sequence h n satisfying b < {I 1 ^ 2 (9o)h\ < B and such that the limiting power of 
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<j>n under h n is less than the right hand side of (13.94). But, h n lies in a compact 
set, so we can extract a further subsequence h nj (if necessary) so that h nj con¬ 
verges. Applying the argument leading to (13.94) to such a subsequence results 
in a contraction. ■ 

We will later apply these results to obtain some asymptotically maximin tests 
of goodness of fit in Sections 14.3 and 14.4. 

Note that the construction of asymptotically optimal tests in the multi-sided 
case depends on the existence of an optimal test for testing the mean vector 
h = 0 when X ~ N(h, I^ 1 (6o)) and I~ 1 (9o) is a known nonsingular covariance 
matrix. For this problem, if the alternatives are specified by |A 1//2 (0o)/i| > b, 
then the maximin test rejects for large values of X t Y1~ 1 (9o)X. But, the maximin 
optimality of this test need not hold if the alternative parameter space is specified 
differently; see Problem 8.30. Moreover, if C is any closed, convex set in lR fc , then 
the test that accepts if and only if X £ C is admissible; see Problem 6.39. Thus, 
the optimality of the maximin test is not so compelling, particularly when k > 1. 


13.6 Applications to Nonparametric Models 

13.6.1 Nonparametric Mean 

Let Xi,... ,X n be i.i.d. with c.d.f. F, mean /^(F 1 ) and variance a 2 (F). Assume 
F £ F, where F satisfies (11.77). We now would like to derive an optimality 
property of the t-test for the mean in a nonparametric setting. Theorem 11.4.5 
implies that the power of the t-test is bounded away from a for distributions F 
whose standardized mean rN 2 fi(F) / cr(F) is bounded away from 0. It is then of 
interest to measure a test sequence by its maximin power over such alternatives, 
with the goal of finding the test that asymptotically maximizes the minimum 
power over such alternatives. Consider testing /.t(.F) = 0 against the alternatives 
fj.(F)/a(F) > 5/n 1 ^ 2 . By Theorem 11.4.5, the limiting minimum power of the 
f-test is 1 — $(zi- a — S). We now show that this is indeed the optimal limiting 
maximin power in a nonparametric setting. 

If the unknown family of distributions F contains the family N(9, 1) for 9 > 0, 
then an optimality result is easy to obtain. Indeed, for any sequence of test 
functions <j> n = <j>n(X i,..., X n ) which satisfies EF((pn) —> a for any F £ F with 
mean 0, we have 

limsup inf EF(<j>n) 

n {F SF, f j.(F)/cr(F)>6n- 1 / 2 } 

< limsup E F=N ^ Sn -i/2 1 ^(4>n) = 1 - $(zi~ a - S) , 

n 

since the right hand side is the optimal limiting power for testing 9 = 0 versus 9 = 
5/n 1 ^ 2 in the normal location model N(9, 1). Hence, the t-test is asymptotically 
maximin since its limiting minimum power attains this bound. 

If the family of distributions F does not contain the normal distributions, the 
above argument does not work. For example, suppose we consider distributions 
supported on [—1,1], Then, we can still obtain an optimality result for the t-test, 
as long as F satisfies (11.77). To this end, let Fo denote the family of all distri¬ 
butions on [—1,1]. Let 4> n be any test sequence satisfying F_f(</>„) —> a if F £ Fo 
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and n(F) = 0. Fix any such F with n(F) = 0 and <j(F) > 0. The smallest power 
over a large class of alternatives can always be bounded above by the smallest 
power over a smaller class. If the smaller class is chosen appropriately, the test¬ 
ing problem for the smaller model (which will be a parametric model that we 
have previously studied) will have relevance for the larger class (the nonparamet- 
ric model we would like to study). So, introduce the parametric submodel with 
density 

Pe(x) = exp(ftr — C{6)) (13.95) 


with respect to F. This is a one-parameter exponential family, and so the 
conditions of Theorem 13.3.2 are satisfied. Let 




He = xpe(x)dF(x) 


be the mean of pg and let ag be its variance. Since p(F) = 0, Ho = 0. In addition, 
He = C' (9) and = C"(9), so that C"(0) = 0 and C"( 0) = a 2 (F) > 0. Then, 


He 

rre 


c\e) 

[ C '"( 6>)] 1 / 2 


e[c"(o )] 1/2 + o(6) 
[C"{9)Y/ 2 


= 9a(F) + o(9) 


as 9 —> 0. Also, for this model, 1(9) = C"(9), so that 1(0) = a 2 (F). It is also 
easy to check that the family (13.95) satisfies (11.77), at least for small enough 
9 (Problem 13.58). 

With S fixed, let 9 n be any fixed sequence such that n}^ 2 9 n > 5/a(F) and 
n 1 / 2 9 n —»• S/a(F). Then, 


nl/2 ^e n /<xe n = n 1/2 9 n /a(F) + o(l) 

as 9 n —» 0. Thus, n 1 ' 2 He n /<xe n > S for all sufficiently large n. So, the problem 
of testing 9 = 0 versus 9 = 9 n is relevant to the nonparametric mean problem 
because 9 = 0 corresponds to a distribution in the null hypothesis parameter 
space while 9 = 9 n corresponds to a distribution in the alternative hypothesis 
parameter space (sequence). Hence, for any test sequence 4 > n , 


limsup inf E F (<j>n) < limsup Eg n (<p n ) . 

n F6F 0 , n 1 / 2 g.(F)/a(F)>5 n 


The right hand side is bounded above by the optimal limiting power for testing 
9 = 0 versus 9 = 9 n . The limiting value was obtained in Theorem 13.3.2 (with 
h = 5/a(F)) and is equal to 


1 - $(«i_ a - ha(F)) = 1 - $(«i_ a - S) . 


Hence, we have shown that 


limsup inf Ef( 4>n) < 1 — &(zi- a — 6) . 

n FCFo, Ti 1 / 2 /i(-F)/<r(.F)>5 

But, the f-test attains the right hand side, and so is asymptotically maximin. 

Of course, one can obtain a bound using other parametric submodels. The 
family pg chosen above certainly works in that it yields an optimality result for 
the f-test. To gain some insight into why this family works, let us consider the 
more general family of densities with densities 


Pr,e(x) = exp (9T(x) - C T (0)\ 
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with respect to F. This assumes that the function T(x) is bounded on [—1,1], or 
at least that T(X) has a moment generating function if X has distribution F. 
Let 


A it, e 


= / xpT,e{x)dF(x ) 


and erf. g be the variance of pr,e■ The functions [at,6 and ar,e are infinitely 
differentiable in 9. Then, 

J x[T(x) - C' T (6)]p T Ax) , 

so that 

Mt.o = J x[T(x) - C' T (Q)}d,F(x) = Covf[X, T{X)\ . 

Then, 


p T ,e = eCov F [X,T(X)\ + o{9) 

and 


&t,6 = (F) + o{9) 


as 9 —> 0. Hence, 

M T,e _ 9 Cov f [A',T(A')] 
a T ,e cr(F) 

as 9 —> 0. Assume Covf[X,T(X)\ 0, in which case we may assume without 
loss of generality that it is positive (or replace T with — T). Let 9 n be any fixed 
sequence with n 1 ^ 2 9 n > Sa(F) /Covf[X,T(X)\ and 

n 1/2 9 n -> Sa(F)/Cov F [X,T(X)} . 


Then, 


, 1 / 2 ^T,e n __ „i/ 2 fl Covf[X,T{X)\ 

7.1 — 77/ (-'77, 


+ o(l) . 


a T,e n u(F) 

So, n 1 ^ 2 pLT,e n /<JT,6 n > S for all sufficiently large n. Thus, for any test sequence 


limsup inf E F (<j>n) < E T ,e n {<j> n ) , 

n F 6F 0 , n 1 / 2 ii(F)/cr(F)>S 

where Er,e n denotes expectation with respect to pr,e n ■ Note that, for this model, 
the Information at 9 = 0 satisfies 

It(0) = C't(O) = Varp 2 [T(X)] . 

The best limiting power among asymptotically level a tests of 9 = 0 versus 9 = 9 n 
was obtained in Theorem 13.3.2 (with h = Su(F)/Covf[X,T(X)]) as 

1 - $(zi- a - h4 /2 (0)) = 1 - $(zi_„ - 5a{F)4 /2 (0)/Cov F [X,T(X)}) . 

This reduces to the previous bound in the case T(X) = X. The sharpest pos¬ 
sible result is obtained by choosing T to minimize the right hand side, which is 
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equivalent to maximizing 

Cov f [X,T(X)\ 

{Var F {X)Var F [T(X)}y / 2 ' 

By the Cauchy-Schwarz inequality, this is bounded above by 1, and the resulting 
value of 1 is attained when A' = T(X). 

Thus, in some sense, the model with T(X) = X is least favorable in that it is 
the hardest parametric submodel to achieve high (limiting) power. The idea of 
using a parametric submodel to obtain efficiency results in nonparametric models 
dates back to Stein (1956b). 


13.6.2 Nonparametric Testing of Functionals 

Suppose Xi,... ,X n are i.i.d. P £ P. In this section, the family P is a non¬ 
parametric family. Specifically, we would like to consider problems where we do 
not assume much or anything about P. Thus, P could be the family of all dis¬ 
tributions on some sample space S, but it might be restricted by moment or 
smoothness conditions, in which case P is still quite large. 

Let d(-) be a statistical functional; that is, 9(P) is a real-valued function of 
P, defined for P £ P. For example, if P is a distribution on IR, 8(P) could be 
the mean of P, or the variance of P. In such cases, P could be the set of all 
distributions with finite variance. Or, if P is a distribution on IR 2 , 8(P) might 
be the correlation of P, defined on the set P of all distributions whose marginals 
have a finite nonzero variance. 

We wish to test the null hypothesis 9(P) < 0 against 8(P) > 0. Fix P with 
9(P) = 0. In order to assess the power of a test at some distribution Q near P, 
we will consider parametric submodels that contain P. The basic idea is that the 
power attainable in the full nonparametric model can be no greater than for any 
submodel. 

Let L 2 (P) denote the space of (equivalence classes of) functions u which are 
square integrable with respect to P. The inner product is given by 

{u,v)p = J u(x)v(x)dP(x), 

and \u\ 2 P = (u,u)p. Also, let L 2 (P) denote the subset of u £ L 2 (P) satisfying 
f u(x)dP(x) = 0. By Problem 12.6, if u G P 2 (P), we can construct a one- 
dimensional q.m.d. family P Ui t indexed by t in some neighborhood of 0, such 
that P„, 0 = P and the score function at t = 0 is u. For example, if u is bounded 
and |t| < [supj. |it(a;)|] _1 , then we can take P u ,t to be the distribution with density 
with respect to P given by 

dP 

"' f (x) = 1 + tu(x) . (13.96) 


(Note that P u ,t £ P if P is the set of all probabilities on S, but if there are 
restrictions on P, this construction may not work.) 

In order to test 8(P u ,t) along such a parametric submodel, we assume that d(-) 
is differentiable in the sense 


8(P u ,t) - 8(P) 


—t (u, 8p)p 


t 


as t —» 0 , 


(13.97) 
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for some function 9p £ L 2 (P). Evidently, this condition implies that, as a real¬ 
valued function of the real variable t, 9(P u ,t) is differentiable at t = 0. 4 Note that, 
if dp satisfies (13.97), then so does dp + c for any constant c; we will henceforth 
assume f 9p(x)dP(x) = 0. 


Example 13.6.1 (Linear Functionals) A statistical functional is linear if it 
can be represented as 

0{P) = J f(x)dP(x) (13.98) 

for some function f £ L 2 (P). In this case, if P u ,t is given by (13.96), then 


0(Pu,t) - 9{P) 

t 


with no error term; that is, 


(u,f)p 


0 P (x) = f(x) - J f{x)dP . (13.99) 

Even if P Ut t is not specifically of the form (13.96), then it can be shown that 0(P) 
is differentiable in the sense of (13.97) with 9p given by (13.99) if 

sup E P [f 2 (X )] < oo ; 

p gp 


see Bic.kel et al. (1993, p.457-458). In particular, if / is a bounded function on a 
set S and P is the set of all probabilities on S, then #(•) is differentiable in the 
sense of (13.97). ■ 


Next, for testing 8(P) < 0 against 0(P) > 0, we obtain an upper bound for 
the limiting local power function along a one-dimensional q.m.d. submodel. Note 
that, under (13.97), 

9{Pu,t) = 0{P) + t(9p,u)p + o(t) as t —> 0 , (13.100) 

which implies 9{P u ,t) > 0 for all small t > 0 if ( 9p,u)p > 0. 

By Lemma 13.3.1 (ii), if h > 0 and {0p,u)p > 0, then (Problem 13.59) 

limsup Ep^ hn _ 1/2 (<t>n) < 1 - 4 > (2i- a - h\u\ P ) . (13.101) 


Fix 5 > 0 and let 


h = h(u, 5) = 


(9p, u)p 

then, n 1 ^ 2 9(P u h ^ u ^- 1 / 2 ) —> J. The bound (13.100) at h(u, <5)n -1 ^ 2 becomes 

S|u|p . 


lim sup Ep _ 1/2 (0„) < 1 - $(zi- a - 

n. u, (u, )n 


(0p, u)i 


(13.102) 


4 The condition (13.97) further asserts that, as a function of u , the limiting value on 
the right side of (13.97) is linear in u as u varies in Lq(P). In fact, the Riesz representation 
theorem (see Theorem 6.4.1 of Dudley (1989)) asserts that any linear function of u must 
be of the form ( u, 0)p for some 6. 
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As it varies, the bound is smallest when \u\p/(8p,u)p is minimized. But, by 
Cauchy-Schwarz, 


Mr 

{dp, it) p 



and equality occurs when u = dp. Note that, when u = dp, the bound (13.102) 
becomes 

1 - - -J-) . (13.103) 

\dp\p 


Moreover, taking it = dp corresponds to the least favorable family (generalizing 
the results of the previous subsection for the mean). Actually, we will obtain a 
stronger result which will allow us to construct locally AUMP tests. First, we 
obtain an upper bound which is smaller than (13.101) and is generally attainable 
for all u. 


Theorem 13.6.1 Let Xi,..., X n be i.i.d. P £ P, where P is the set of all 
probabilities on space S (endowed with a a-field). Assume d(-) is differentiable 
in the sense (13.97). Fix P with d{P) = 0, u 6 L 2 (P), and let {P u ,t} denote a 
q.m.d. submodel, defined for t in some neighborhood of 0 with P u ,o = P and score 
function u. Let rf>„ = <j> n {X i,..., X n ) be a sequence of level a tests of d(P) < 0. 
If (dp, u)p > 0 and h > 0, then, 

limsup Ep _ 1/2 (<M < 1 ~ $ (zi-a - h ^ P l^ P \ . (13.104) 

n n \ \0\p ) 

Proof. Without loss of generality, assume \u\ 2 P = 1. Let v = dp — ( dp,u)pu. 
Note that v £ Lq(P), (u,v)p = 0, and 

(v,v)p = \d P \ 2 p — (d P ,u) 2 p . 

Consider a two-dimensional parametric submodel P u ,vM,t 2 indexed by (fi,£ 2 ) 
in some neighborhood of the origin in 1R 2 such that the score function at 
(£ 1 ,^ 2 ) = (0,0) is (u,v) T . (See Problem 12.7 for a construction.) The ex¬ 
periments hirl -i/ 2 h2n ~i /2 converge to a normal experiment where you 
observe ( Z\,Z 2 ) T with mean E(Zi) = hi, Var(Zi) = 1, Var(Z 2 ) = \v\% and 
Cov(Zi, Z 2 ) = 0 (since (u, v)p = 0). 

Fix hi and /12 and let ti = thi. Then, hiu + h. 2 V is the score function for the 
family P u ,v,h 1 t,h 2 t indexed by t. Moreover, 

9{P u ,v,thi ,th 2 ) - d(P) = t(hiu + h 2 v, dp)p + oft) . 

So, if (/lilt + / 12 V, dp)p < 0, we have 

limsup Ep u v hin _ 1/2 h2n _ 1/2 {<l>n) < a . 

Therefore, by Theorem 13.4.1, the local limiting power of <j> n along any 
subsequence can be bounded above by the power of <j> = <p(Z 1 , Z 2 ), where 

Eh 1 ,h 2 (</>) < a if (/iiit + h 2 v, dp)p < 0 , 
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and by continuity the result holds if (h\u + h, 2 V , 9p) p = 0 as well. But, the UMP 
level a test for testing 

hi(u, 9p)p + h,2(v, 9p)p < 0 

rejects if 

Zi(u,9p)p + Z2 {v,9p)p > z\- a \J(u,9p) 2 p + (v,9p) 2 p \v\ 2 p = zi- a \9p\p , 
which has power with hi = h and h .2 = 0 given by the right side of (13.104). ■ 

Remark 13.6.1 The tests </>„ need not be exact level a. All that is required 
is that limsup„ Ep _ 1/2 {4>n) < a if h has the opposite sign of (9p,u)p. This 
must hold for u in the statement of (13.104) as well as any linear combination of 
u and 9p. 

The result and the proof applies even if P is not the set of all probabilities 
on S. What is required is that the two-dimensional model P u ,v,ti,t 2 used in the 
proof also belongs to P. Also, it only required that the differentiability condition 
need only hold for submodels P u ,v,hit,h 2 t- For semiparametric models, the result 
needs to be modified, but a similar result holds; see Theorem 25.44 of van der 
Vaart (1998). ■ 

Next, we consider tests whose power attains the bound (13.104). 

Example 13.6.2 (Linear Functionals, continued) Let P n be the empirical 
measure, i.e., P n {E} is the proportion of observations that fall in E. Then, tests 
of 9(P) can be based on 9(P„) = n -1 J2i Under Q(P) = 0, 

n 1/2 9(P„) 4 AT(0,|/|p) . 

Since \f\p is unknown, consider the test that rejects when n 1 ^ 2 9(P n )/S n > Zi- a , 
where 

1 n 

4 = 1 

Under P, S ^ —»• |/|p; by contiguity, this holds under P n h _i/ 2 as well. By 
Example 12.3.8, under P" hn - 1 / 2 , 

n 1/2 9(P n ) A N(h(f,u)p, \f\p) . 

By Slutsky’s Theorem, under P u hn _i/ 2 , 

n 1/2 9(P n )/S n 4 . 

\j\p 

Therefore, the limiting power of the above test against P" _ 1/2 is the upper 
bound (13.104). Moreover, the convergence to the limiting power is uniform in h 
for 0 < h < c and any c > 0 (Problem 13.61). The resulting test is locally AUMP 
against all such alternatives. For example, the result applies to one-sided tests 
of 0(P) = P{E}, and tests based on the empirical measure are asymptotically 
LAUMP. ■ 
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Example 13.6.3 (Variance Functional) Suppose P is a distribution on IR, 
and P is the set of all distributions with a uniformly bounded fourth moment. Let 
<j 2 (P) denote the variance of P, and p(P) denote the mean of P. The problem 
is to test a 2 (P) < a 2 . Let 8(P) = <r 2 (P) — a 2 . Then, the conditions of Theorem 
13.6.1 hold with 

9 P (x) = [x - p(P)] 2 - 9(P) , 

and the test that rejects when n 1/,2 [#(P„) — <Jo]/S n > zi-a attains the bound 
(13.104), where S' 2 is a consistent estimator of the variance of [Xi — /x(P)] 2 , such 
as 

Sl = n- 1 J2( x i - ^n) 4 - <A^«) • 

i 

The details are left to Problem 13.63. ■ 

In general, consider tests of 0(P) based on 9(P n ). This implicitly assumes 
#(•) is defined for empirical measures. Suppose 9(P n ) is an asymptotically linear 
statistic in the sense that 

n 1/2 [(9(P„) - 8{P)] = j 8 P d(P n - P) + op(1) . (13.105) 

This can be verified directly in examples where 9(P n ) is a smooth function of 
sample means, such as the previous example. Otherwise, 9 must be differentiable 
in an appropriate sense, but such an approach is beyond the scope of the treat¬ 
ment here; see Serfling (1980, Chapter 10) or van der Vaart and Wellner (1996, 
Section 3.9). Note that (13.105) implies that, under P, 

n 1/2 [9(P n ) - 9(P)] ^ N(0,\9p\ 2 p ) . 

In order to construct an optimal test, it is necessary to construct a consistent 
estimator of \9p\p. Assuming S n is such a consistent estimator, the test that 
rejects for large n 1 ^ 2 9(P n )/S n is asymptotically LAUMP, by the same argument 
used in Example 13.6.2. General approaches for constructing an estimator of the 
asymptotic variance of n 1 ^ 2 9{P n ), as well as a means of estimating its sampling 
distribution, are provided by bootstrap resampling and subsampling, which will 
be discussed in Chapter 15. 


13.7 Problems 

Section 13.1 

Problem 13.1 (i). Let P, have density Pi with respect to a dominating measure 
p. Show that ||Pi — Po||i defined by J |pi — po\dp is independent of the choice of 
p and is a metric. 

(ii). Show the Hellinger distance defined in (13.12) is also independent of p and 
is a metric. 

Problem 13.2 Show that ||Pi — Po||i can also be computed as 

2 sup |Pi(P) — Po(P)| , 

B 
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where the supremum is over all measurable sets B. In addition, it may be 
computed as 


sup 

{4>:|<AI<1} 


J <f>(x)dPi(x) — j </>(x)dPo(x) 


where the supremum is over all measurable functions <j> such that sup^, \4>{x )| < 1. 


Problem 13.3 (i) Suppose X is a random variable taking values in a sample 
space S with probability law P. Let u>o and wi be disjoint families of probability 
laws. Assume that, for every Q £ uj\ and any e > 0, there exists a subset A 
of S (which may depend on e) such that Q(A) > 1 — e and such that, if A' 
has distribution Q, then the conditional distribution of A' given A' £ A is a 
distribution in u>o; call it P e . Show \\Q — P e ||i —> 0 as e —> 0. 

(ii) Based on data A' with probability law P, consider the problem of testing the 
null hypothesis P £ u>o versus P £ u>i. Suppose that, for every Q £ wi, there 
exists a sequence {Pfc} with Pk £ wo such that ||Q — Pk ||i —> 0 as k —> oo. Show 
that if a test (p is level a, then Eq[(P(X)\ < a for all Q £ ui\. 

(iii) Suppose Xi ,..., A'„ are i.i.d. on the real line. Let u>o be distributions with a 
finite mean and u>i those without a finite mean. Apply (i) and (ii) to show that 
no level a test of o>o versus u i has power > a against any Q £ u>\. 

[Such nonexistence results data back to Bahadur and Savage (1956); see Lemma 
11.4.4. This example in (iii) and others are treated in Romano (2004), which also 
contains many references on such problems.] 


Problem 13.4 Let Pg be uniform on [0,0]. Let 9 n = 6q + h/n. Calculate the 
limit of nH 2 (Pg 0 , Pg 0 +h/n)- If h > 0, let (p n be the UMP level a test which rejects 
when the maximum order statistic is too large. Evaluate the limit of the power 
of <j> n against the alternative 9 n . 


Problem 13.5 Prove Lemma 13.1.1. 


Problem 13.6 Consider testing Pg 0 versus Pg n and assume nH 2 (Pg 0 , Pg n ) —> 
0 . Let 4> n be any test sequence such that lim sup Eg 0 (4> n ) < «• Show that 
limsup-Een^n) < a. 

Problem 13.7 Let Pg be N(9, 1). Fix h and let 9„ = hn _1 ^ 2 . Compute 
S(Po,Pg n ) and its limiting value. Compare your result with the upper bound 
obtained from Theorem 13.1.3. 

Problem 13.8 If I{9q ) is a positive definite Information matrix, show ft = 0 if 
and only if (ft.,7(0o)ft) = 0. 

Problem 13.9 Let Ai,..., A'„ be i.i.d. according to a model {Pg, 9 £ Q}, where 
9 is real-valued. Consider testing 0 = 0 O versus 9 = 9 n at level a (a fixed, 
0 < a < 1). Show that it is possible to have nH 2 (Pg Q , Pg n ) —» c < oo and still 
have a sequence of level a tests 4> n = <j> n {Xi,... ,X n ) such that Eg n {ij> n ) —> 1. 
Hint: Take Pg uniform on [0, 0] and 9 n = 0q — h/n for ft > 0. 
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Problem 13.10 Suppose ||P n — Q n ||i —> 0. Show that P n and Q n are mutu¬ 
ally contiguous. Furthermore, show that, for any sequence of test functions 4> n , 
f <f>ndP n - f <t>ndQ n -¥ 0. 

Problem 13.11 For a q.m.d. family, show nH 2 (P 0o+hn -i/ 2 , Pg 0+hnn -i/ 2 ) —t 0 
whenever h n —» h. Then, show P” +h n _i/ 2 is contiguous to P 0Q whenever h n —> 
h. 

Problem 13.12 Use Problem 13.11 to show that Theorem 12.2.3 (i) remains 
valid if h is replaced by h n as long as h n falls in a bounded subset of lR fc . Then, 
show that, for any c > 0, the supremum over h such that \h\ < c of the left side 
of (12.13) tends to 0 in probability under # 0 . Also, show part (ii) of Theorem 
12.2.3 generalizes if h in the left hand side of the convergence (12.14) is replaced 
by h n —> h in IR fc . 

Problem 13.13 Use problem 13.11 to prove Theorem 12.4.1 when h„ —» h. 

Problem 13.14 Give an example where \\Q n — Pn||i —> <5 > 0 but P„, and Q n 
are mutually contiguous. 

Problem 13.15 Let P n and Q n be two sequences of probability measures de¬ 
fined on (Q n , Pn). Assume they are contiguous. Assume further that both of them 
are product measures, i.e. 

n n 

Pn = 11 Pn,i and Qn = | | Qn,i • 

4=1 4 = 1 

Let \\Q — P\\ 1 denote the total variation distance between P and Q. Show that 

n 

sup£ || Qn,i - Pn,i 111 < OO . 
n i= 1 

Problem 13.16 Let f(x) be the triangular density on [—1, 1] defined by 
f(x) = (1 - \x\)I{x £ [-1,1]} . 

Let Pg be the distribution with density f(x — 9). Find the asymptotic behavior of 
H(Pg 0 , Pg 0+ h) as h —> 0, where H is the Hellinger distance. Compare your result 
with q.m.d. families. 


Section 13.2 

Problem 13.17 Under the assumptions of Theorem 13.2.1, suppose 9k —> 9q 
and (3 > a > 0. Show, for any N < oo, there does not exist a test (f>k with k < N 
such that liminffc Eg k (<j>k) > j3. 

Problem 13.18 Under the assumptions of Example 13.2.1, show that the 
squared efficacy of the Wald test is I(9q). 
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Problem 13.19 Suppose = {#o}- In order to determine c = c(n, a) in 
(13.32), define c(n,a) to be 

c(n, a) = inf{d : Pg 0 {T n > d} < a} . 

Argue that this choice of c(n,a) satisfies (13.32). What if T„ > d is replaced by 
T n >d ? 

Problem 13.20 For a double exponential location family, calculate the Pitman 
AREs among pairwise comparisons of the t-test, the Wilcoxon test, and the Sign 
test. 

Problem 13.21 Prove the inequality (13.30). Hint: The quantity (13.29) is in¬ 
variant with respect to scale. By taking a 2 = 1, the problem reduces to choosing 
/ to minimize f f 2 subject to / being a mean 0 density with variance 1. Using 
the method of undetermined multipliers, it is sufficient to minimize 

J [/ 2 (*) + 26 0 2 - a 2 )f(x)]dx , 

where a and b are chosen so that / is a mean 0 density with variance 1. 

Problem 13.22 Suppose X\ 1 , X, n are i.i.d. Poisson with unknown mean 9. 
The problem is to test 9 = 9 o versus 9 > 9$. Consider the test that rejects for 
large X„ and the test that rejects for large 

Si = - x n f. 

n — 1 z —' 

i= 1 

Compute the Pitman ARE. 

Problem 13.23 Suppose Xi,...,X n are i.i.d. N(0,a 2 ). Let T U: i = Y n = 
n -1 Yh=i Y i, where Y) = X 2 . Also, let T„, 2 = (2n) -1 5D™ =1 (Ti — Y n ) 2 . For testing 
<7 = 1 versus a > 1, does the Pitman asymptotic relative efficiency of T n ,i with 
respect to T n<2 exist? If so, find it. 

Section 13.3 

Problem 13.24 For testing 9 = 9o versus 9 > 9o, define two test sequences 
4>n and i p n to be asymptotically equivalent under the null hypothesis if <j> n — 
i/) n —> 0 in probability under # 0 . Does this imply that, if 9o is the true value, 
the probability the tests reach the same conclusion tends to 1? Show that, under 
q.m.d., asymptotic equivalence under the null hypothesis also implies that, under 
an alternative sequence 9 n ,h = 9o + hn ~ x ^ 2 , 

E e nth {l>n) — Ee„ th (il>n) -t 0 . 

Furthermore, assume at least one of the two, say 4> n is nonrandomized. Then, 
conclude the tests are asymptotically equivalent in the sense that the probability 
the tests reach the same conclusion tends to 1, both under 9o and a sequence 
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Problem 13.25 Under the q.m.d. assumptions of this section, show that 4> n ,h 
given by (13.34) and </>„ given by (13.43) are asymptotically equivalent in the 
sense of Problem 13.24 for testing do against do + hn~ 1 ^ 2 . 


Problem 13.26 Let AT,..., X n be i.i.d. N(d, 1). For testing d = 0 against d > 0, 
let (j>n be the UMP level a test. Let rf) n be the test which rejects if A'„ > b n /n 1 ^ 2 
or X n < — an/n 1 ^ 2 , where b n = zi- a +n -1 ^ 4 and o„ is then determined to meet 
the level constraint. Are the tests asymptotically equivalent? Show that, for all 
d > 0, 


1 — Ee((f>n) 

1 -E e (4> n ) 


as n —» oo . 


How do you interpret this result? [Lehmann (1949)] 


Problem 13.27 Prove Lemma 13.3.1 (iii). Hint: Problems 13.12-13.13. 


Problem 13.28 Prove Theorem 13.3.1. 


Problem 13.29 Prove the equivalence of Definition 13.3.2 and the definition 
in the statement immediately following Definition 13.3.2. What is an equivalent 
characterization for LAUMP tests? 

Problem 13.30 For testing do versus d n , let (j>„ be a test satisfying 

limsupi?g 0 ((()*) = a* < a 

n 

and Ee n (4>n) —> P* ■ 

(i) Show there exists a test sequence i<p n satisfying limsup n Ee 0 {^) n ) = a and a 
number (3 such that 

lim Ee n (ip n ) = P > P* , 
and this last inequality is strict unless p* = 1. 

(ii) Hence, show that, under the conditions of Theorem 13.3.3, any LAUMP level 
a test sequence rpn satisfies Ee 0 (< pn ) —> a. 

Problem 13.31 Suppose Z n is any sequence of random variables such that 
Vare n (Z n ) < 1 while Ee n (Z n ) —» oo. Here, d n merely indicates the distribution 
of Z n at time n. Show that, under d n , Z n —» oo in probability. 

Problem 13.32 In the double exponential location model of Example 13.3.2, 
show that a MLE estimator is a sample median d n . The test that rejects the null 
hypothesis if vp^dn, > Z\- a is AUMP and is asymptotically equivalent to Rao’s 
score test in the sense of Problem 13.24 


Problem 13.33 For the Cauchy location model of Example 13.3.3, consider the 
estimator d n defined by (13.59). Show that the test that rejects when vp^ 2 d n > 
2 1 / 2 zi_ Q! is AUMP. Is the estimator location equivariant? Is the estimator 9 n = 
d n ( A'i, ..., X n ) monotone in the sense it is nondecreasing as any one component 
Xi increases? 
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Problem 13.34 Let Xi,... , X n be i.i.d. according to a q.m.d. location model 
f(x — 6). Let 9 n be any location equivariant estimator satisfying (13.58) (such as 
an efficient likelihood estimator). For testing 9 < 0 against 9 > 0, show that the 
test that rejects when n}^ 2 9 n > l~ 1 ^ 2 (0)zi- a is AUMP. 

Problem 13.35 Assume the conditions of Theorem 13.3.3. Assume (fin. is 
LAUMP level a. Suppose the power function of (fi n is nondecreasing in 9, for 
9 > 9o- Show (fi„ is also AUMP level a. 

Problem 13.36 Assume the conditions of Example 13.3.1. Further assume / is 
strongly unimodal, i.e., — log(/) is convex. Show the test <j> n given by (13.43) is 
AUMP level a. Hint: Use Problem 13.35. 

Problem 13.37 Suppose Xi,...X n are i.i.d. Poisson(A). Consider testing the 
null hypothesis Ho : A = Ao versus the alternative, Ha : A > Ao- 

(i) Consider the test (fin with rejection region n}^ 2 [X n — Ao] > zi- a X ^ 2 , where 
$(z a ) = a and <F is the cdf of a standard normal random variable. Find the 
limiting power of this test against Ao + hn~ x ^ 2 . 

(ii) Alternatively, let g be a differentiable, monotone increasing function with 
i/(Ao) > 0, and consider the test (fin with rejection region 

n 1/2 [g(X n ) - g( A 0 )] > zi- a g' (Xo)Xl /2 . 

Show that <fi\ and (fin are equivalent in the sense that, for any b > 0, 

sup E x hn - 1/2 \(fili - <fi 9 n | -4 0 . 

0<h<b 

(iii) Can we replace b by oo? 

Problem 13.38 Suppose Xi, ...X n are i.i.d. N(9, 1 +9 2 ). Consider testing 9 = 6q 
versus 9 > 9o and let <fi n be the test that rejects when n 1 / 2 [x n — 0 O ] > 2i-a(l + 

9l) 1/2 . 

(i) Compute the limiting power of this test against #o + hn 1 ^ 2 . 

(ii) Is this test AUMP? 

Problem 13.39 Define appropriate extensions of the definitions of LAUMP and 
AUMP to two-sided testing of a real parameter. Let Xi, ..., X n be i.i.d. N(9, 1). 
Show that neither LAUMP nor AUMP tests exist for testing 9 = 0 against 9^0. 


Section 13.4 

Problem 13.40 Suppose {Q n ,k, h G IR fe } is asymptotically normal according to 
Definition 13.4.1, with Z n and C satisfying (13.62). Show the matrix C is uniquely 
determined. Moreover, if Z n is any other sequence also satisfying (13.62), then 
Zn — Zn —t 0 in Q n ^-probability for any h. 

Problem 13.41 Suppose {Q n ,h,h € IR fc } is asymptotically normal. Show that 
Qn,h 1 and Q n ,h 2 are mutually contiguous for any hi and /i 2 - 
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Problem 13.42 Assume {Q n ,h,h £ IR fc } is asymptotically normal according 
to Definition 13.4.1, with Z n and C satisfying (13.62). Show that, under Q n ,h, 

Z n 4 N(Ch,C). 

Problem 13.43 Let dN(h, C ) denote the density of the normal distribution 
with mean vector h £ IR fc and positive definite covariance matrix C. Prove that 
exp ({h,x} — ^{h,Ch))dN(0,C)(x) is the density of N(Ch,C) evaluated at x. 
Hint: Use characteristic functions. 


Section 13.5 

Problem 13.44 In the location scale model of Example 13.5.2, verify the ex¬ 
pressions for the Information matrix. Deduce that the matrix is diagonal if / is 
an even function. 

Problem 13.45 For the location scale model of Example 13.5.2 with f(x) = 
C(/3) exp[— \x\^], argue that the family is q.m.d. if (d > 1/2. 

Problem 13.46 For the location scale model in Problem 13.45, show that, for 
testing p < 0 versus p > 0, argue that the Wald test is LAUMP if /3 > 1. If a n 
is replaced by any consistent estimator of a , does the LAUMP property continue 
to hold? If 1/2 < (3 < 1, argue that the Rao test is LAUMP. 

Problem 13.47 In Example 13.5.3, for testing p < 0 versus p > 0, find the 
optimal limiting power of the LAUMP against alternatives hn~ x ^ 2 . Compare 
with the case where the means and variances are known. Generalize to the case 
of testing p < po against p > po • 

Problem 13.48 Derive the inequality (13.74) under general conditions which 
assume the model is asymptotically normal. 

Problem 13.49 Assume (13.75) and the setup described there. Show that the 
test that rejects when g{9 n ) > zi- a dn is pointwise level a and has a power 
function such that there is equality in (13.74). 

Problem 13.50 Verify (13.76) as well as the form of the matrix C{9q). 

Problem 13.51 Assume the conditions of Theorem 13.5.1, Consider the prob¬ 
lem of testing g(8) = 0 against g(6) ^ 0. Restrict attention to tests </>„ that are 
asymptotically unbiased in the sense 

liminf inf £/(</„) > a , 

71 f«: 9(9)#0} 

as well as (13.69). Prove a result analogous to Theorem 13.5.1. Hint: See Problem 
5.10. 

Problem 13.52 Consider the one-sample N(p, 1) problem for testing |/r| > A 
versus |/k| < A. Show that the level a test based on combining the two one-sided 
UMP level a tests has size strictly less than a. 
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Problem 13.53 Show that the size of the TOST test considered in Example 
13.5.5 is a. 

Problem 13.54 Let C = C(a,S, a) be defined by (13.86). Show that C > S — 
<72i_a.. Use this to show that, in Example 13.5.6, the limiting power of </>* always 
exceeds that of <(>„ T . 

Problem 13.55 As in Example 13.5.7, consider testing 9 = 9o versus 9 ^ 9o- 
Suppose 4> n is asymptotically level a and asymptotically unbiased in the sense 

lminfE 9o+hn _i /2 (0 n ) > a 

for any h ^ 0. Argue that, among such tests </> n , the two-sided Rao test <j > n ,2 is 
LAUMP. 

Problem 13.56 Generalize Example 13.5.7 to the case of testing 8 — 9o versus 
9 ^ 9q in the presence of nuisance parameters. 

Problem 13.57 Under the conditions of Theorem 13.5.5 used to prove an 
asymptotic maximin result for Rao’s test, derive analogous optimality results 
for both the Wald and likelihood ratio tests. 


Section 13.6 

Problem 13.58 Show that the family of densities (13.95) satisfies (11.77) for 
small enough 9. 

Problem 13.59 Verify (13.101). 

Problem 13.60 Compare the bounds (13.101) and (13.104). For what u is each 
attainable? Why is (13.101) generally not attainable for all it, even though there 
exists a test for the submodel {Pu,t} for which the bound is attainable. 

Problem 13.61 In Example 13.6.2, argue that the given test attains the optimal 
limiting power uniformly in h , for 0 < h < c and any c > 0. 

Problem 13.62 In Theorem 13.6.1, compute the limiting power against 
P u hn- 1 / 2 where h is chosen so that n 1 ^ 2 9(P u hn -\/i) 5. [The solution does 

not depend on u but only on the value of <5, which was noted by Pfanzagl and 
Wefelmeyer (1985).] 

Problem 13.63 Provide the details for the optimality claimed in Example 
13.6.3 for testing the variance in a nonparametric setting. 

Problem 13.64 Let P be the set of all joint distributions in 1R 2 on some com¬ 
pact set. Let 6(P) denote the correlation functional. For testing 9(P) < 0, 
construct an asymptotically optimal test in a nonparametric setting. 

Problem 13.65 Consider testing the difference of two population means 
n{Px) — n(Py) < 0 in a nonparametric setting. Generalize Theorem 13.6.1 to 
obtain locally AUMP tests. 



582 


13. Large Sample Optimality 


13.8 Notes 

The Hellinger distance introduced in Section 13.1 was fundamental in Kakutani 
(1948) and does not seem to have been employed by Hellinger (Le Cam and Yang 
(2000), p. 48). The use of Hellinger distance to construct estimators and tests is 
developed in Beran (1977) and Simpson (1989). 

The concept of Pitman asymptotic relative efficiency can be traced to an un¬ 
published set of his lecture notes in (1949); Noether (1955) published a slightly 
more general result. The inequality (13.30) is due to Hodges and Lehmann (1956). 
Further results and references can be found in Serfling (1980) and Nikitin (1995). 
Some important alternative concepts of efficiency can be found in Bahadur (1960, 
1965), Kallenberg (1982, 1983), and Inglot, Kallenberg and Ledwina (2000). Some 
numerical calculations are given in Groeneboom and Oosterhoff (1981). Higher 
order asymptotic comparisons can be approached through the concept of defi¬ 
ciency, introduced in Hodges and Lehmann (1970). Some general results for rank 
and permutation tests in the one-sample problem are obtained in Albers, Bickel 
and van Zwet (1976); analogous results for the two-sample problem are obtained 
in Bickel and van Zwet (1978). Pitman efficiencies of multivariate spatial sign 
and rank tests are considered in Peters and Randles (1991) and Mottonen, Oja 
and Tienari (1997). Asymptotic efficiency of rank tests is studied in Behnen and 
Neuhaus (1989) and Hajek, Sidak, and Sen (1999). Higher order efficiency is also 
considered in Bening (2000). 

Our approach to large sample efficiency of tests is largely due to ideas in Wald 
(1939, 1941ab, 1943), though his assumptions were too strong. He focused on 
MLEs and the tests now known as Wald tests. Wald basically argued that one 
could construct optimal large sample tests based on the normal approximation 
to the MLE. A more formal approach was later provided by Le Cam’s (1964, 
1972) elegant notion of convergence of experiments, of which convergence to a 
normal experiment in the sense of Definition 13.4.1 is an important special case. 
This approach was used in Choi, Hall and Schick (1996). For references of (lo¬ 
cal) asymptotically normal experiments in time series models, see Hallin et al. 
(1999). Generalizations to limiting Poisson experiment and locally asymptotically 
quadratic experiments are discussed in Le Cam and Yang (2000). Roussas (1972) 
formulated and developed the concept of AUMP tests. The proof of Theorem 
13.4.1 is based on Lemma 3.4.4 of Rieder (1994). The results in Section 13.5.2 
are obtained in Romano (2005). Nonparametric tests of equivalence are studied 
in Janssen (2000b); also see Wellek (2003). The reduction of a nonparametric 
problem to a parametric one through the use of a least favorable family is due 
to Stein (1956b), and is prominent in the work of Koshevnik and Levit (1976), 
Pfanzagl (1982, 1985), Bickel et al (1993) and Janssen (1999), among others. The 
proof of Theorem 13.6.1 is based on the more general result Theorem 25.44 of van 
der Vaardt (1998). Efficiency of nonparametric confidence intervals is discussed 
in Low (1997) and Romano and Wolf (2000). 
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14.1 Introduction 

So far, the principal framework of this book has been optimality (either exact or 
asymptotic) in situations where both the hypothesis and the class of alternatives 
were specified by parametric models. In the present chapter, we shall take up 
the crucial problem of testing the validity of such models, the hypothesis of 
goodness of fit. For example, we would like to know whether a set of measurements 
AT,... ,X n is consonant with the assumption that the A’s are an i.i.d. sample 
from a normal distribution. 

A difficulty in testing such a hypothesis is that the class of alternatives typically 
is enormously large and can no longer be described by a parametric model. As 
a result, although some asymptotic optimality results are presented, they are 
isolated; no general asymptotic optimality theory seems to exist for this problem. 
In fact, there is growing evidence, such as the results of Janssen (2000a) (see 
Theorem 14.6.2), that any test can achieve high asymptotic power against local 
or contiguous alternatives for at most a finite-dimensional parametric family. 

Because of the importance of the problem of testing goodness of fit, we shall 
nevertheless consider this problem here. However, the focus will no longer be on 
optimality. Instead, we shall present some of the principal methods that have 
been proposed and study their relative strengths and weaknesses. 

For the sake of simplifying a very complicated problem we shall consider the 
case where Xi,..., X n are i.i.d. according to some probability distribution P, 
and shall mostly assume that the null hypothesis P = Po completely specifies 
the distribution. While this assumption frequently is not fulfilled in applications, 
it makes it possible to cover some principal features of the problem which carry 
over to the more complex case of composite hypotheses. 
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In the case where the observations are real-valued, we index the unknown 
distribution by the underlying c.d.f. F and the problem is to test F = Fo- We 
will typically consider the case where Fq is the uniform distribution on (0,1). 
This special case can be generalized to the problem of testing the simple null 
hypothesis F[ that Xi ,..., X n are i.i.d. from any fixed continuous c.d.f. F on the 
real line. To see how, define Yj = F(Xi), so that the Y; are i.i.d. U(0,1) under 
F[ (Problem 3.22); then, test the hypothesis that Yi,..., Y n are i.i.d. uniform on 

[MI- 

Let F n be the empirical c.d.f., which uniformly tends to F with probability 
one, by the Glivenko-Cantelli theorem. For testing the simple null hypothesis 
F = Fq, a natural starting point is to base a test statistic on some measure of 
discrepancy between F n and Fo. In particular, if d is any metric on the space of 
distribution functions, then d(F n ,Fo) could serve as a test statistic. A classical 
choice is d = dK, the Kolmogorov-Smirnov metric, which historically was the 
first test of goodness of fit that is (pointwise) consistent against any alternative. 
This test is studied in Section 14.2, but many other choices are possible; see 
14.2.2. Two such choices are the Cramer-von Mises statistic and the Anderson- 
Darling statistic; in fact, these choices are often much more powerful than the 
Kolmogorov-Smirnov test. 

In Section 14.3, the classical Chi-squared test is studied, and its asymptotic 
properties are derived. The class of Neyman smooth tests is considered in Section 
14.4; it includes the Chi-squared test as a special case, and serves to motivate the 
class of weighted quadratic test statistics studied in Section 14.5. The difficulty 
of constructing goodness of fit tests with good power against broad alternatives 
is studied in Section 14.6. 


14.2 The Kolmogorov-Smirnov Test 

14-2.1 Simple Null Hypothesis 

Suppose X \...., X n are i.i.d. real-valued observations with c.d.f. F, and consider 
the problem of testing the simple null hypothesis that F = Fo versus F ^ Fq. 
The classical Kolmogorov-Smirnov goodness of fit test statistic, introduced in 
Section 6.13 and Example 11.2.12, is 

T n = sup n 1/2 \F n (t) - Fo{t)\ = n 1/2 d K (F n , F 0 ) , (14.1) 

teIR 

where dx is the Kolmogorov-Smirnov distance 

dx(F, G) = sup \F(t) — G(t)| . 

t 

Note that dx(F , G) = 0 if and only if F = G. 

The distribution of T„ under F is the same for all continuous F (Problem 
11.57). Let s n ,i-c* be the 1 — a quantile of the distribution of T n under any 
continuous F. The Kolmogorov-Smirnov test rejects the null hypothesis if T n > 
Sn,i-a- If Fq is not continuous, using s^i-a results in a test that has level less 
than a (Problem 11.58), but in principle, one can determine (or simulate) a 
critical value that yields an exact level a test for this situation. Much of the 
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remaining discussion in the section will focus on the case where the critical value 
s n ,i-a is used (but the arguments apply more generally). For references to tables 
of critical values and finite sample power calculations, see the references given in 
Example 11.2.12. 

In order to study the limiting behavior of T„, introduce the function 

B n (t ) = n 1/2 [F n {t) - F 0 (t)} . (14.2) 

For each t, B n (t) is a real-valued random variable; in addition, B„(-) can be 
viewed as a random function (or process) on [0,1], called the empirical process. 
By the multivariate Central Limit Theorem, if the null hypothesis is true, then 
for any ti,... ,tk, 

[B n (ti),B n (tk)\ 4 [.B(ti),..., B(tk)] , (14.3) 

where [B(ti),..., B(tk)] has the multivariate normal distribution with mean 0 
and covariance matrix S, whose (i,j)th entry tnj is given by 

= (FoiUKl-FoiU)) if i~j (U4) 

l,j \F 0 (min(ti,tj)) — F 0 (ti)F 0 (tj) otherwise. 

By the Continuous Mapping Theorem, it follows that, for any t\,...,tk, 

max n 1/2 \F„(ti) — F 0 (U)\ 4 max \B(U)\ . (14-5) 

In fact, B(-) itself can be represented as a random continuous process on [0,1], 
called the Brownian Bridge process. The study of random functions and empirical 
processes is beyond the scope of this book, but it is developed in Pollard (1984) 
and van der Vaart and Wellner (1996). However, the result (14.5) provides both 
insight and a basis for a rigorous treatment of the limiting behavior of T„, which 
is the supremum over all t, and not just a finite set, of |B„(t)|. It turns out that T„ 
has a limiting distribution which is continuous and strictly increasing on (0, oo). 
More specifically, Kolmogorov (1933) showed that if Fa is continuous, then for 
any d > 0, 

OO 

P{T n > d} ->• 2^(-T) fc+1 exp(-2fcV) . 

fc =i 

The 1 — a quantile of this distribution will be denoted by si_ a . 

We now discuss some power properties of the Kolmogorov-Smirnov test. 

Theorem 14.2.1 The Kolmogorov-Smirnov test is pointwise consistent in power 
against any fixed F ^ Fa; that is, 

Pf{Tu > Sti.I— a } —> 1 

as n —» oo. 

Proof. By the Glivenko-Cantelli theorem, under an alternative F, 
sup | F n (t) - Fo(t) | ->■ d K {F , F 0 ) > 0 

t 

almost surely, and so T n —» oo almost surely. Hence, by Slutsky’s theorem, 

Pf{T h > s TO ,i_ a } —> 1 , 
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since s n ,i- a —> si- a < oo. ■ 

For an alternative instructive proof of consistency (due to Massey (1950)), fix 
any F with dx{F,Fo) > 0. Then, there exists some t with F(t) ^ Fo(t). First, 
assume F(t) > Fo(f). Then, 

Pf{Tu > Sn.l-o} > PF{|n 1,/2 [F n (t) — Fo(t)]\ > Sn,,l-a} 

> P F {n 1/2 [F n (t ) - F(t)] > 8 n , 1 _« - n 1/2 [F(t) - F 0 (t)]} , (14.6) 

which tends to 1 as n —» oo since the left side in the probability expression is 
bounded in probability while the right hand tends to — oo. Hence, the limiting 
power is 1 against any F if there exists a t with F(t) > Fo(t). By similar reasoning, 
the limiting power is 1 against F with F(t) < Fo(t) for some t, and hence for any 
F^F 0 . 

We now show that the Kolmogorov-Smirnov test is uniformly consistent in 
power against alternatives F satisfying n 1 ^ 2 dn(F, Fo) > A n , as long as A n —¥ oo. 

Theorem 14.2.2 Let Xi,..., X n be i.i.d. random variables with c.d.f. F. For 
testing F = Fo against F ^ Fa, the power of the Kolmogorov-Smirnov test tends 
to one uniformly over all alternatives F satisfying n 1 ' 2 dk(F, Fo) > A„ if A n —» 
oo as n —¥ oo; that is, 

inf | P F {T n > s„,i_ct} : n 1,/2 dir(P, Po) > A„| —t 1 

if A n —> oo. 

Proof. Let F n be any sequence satisfying n 1 ^ 2 dx(Pn, Fo) > A n . By the triangle 
inequality, 

dnr(F n , Fo) < dn:(F n , F„) + drc(F n , Fo) , 

which implies 

T n > A n — n 1//2 d/c(F n , F n ) . 

Therefore, 

P Fn {Tn > Sn,l—a } > Pf„ (F n , F n ) < A n — . (14.7) 

But, by Problem 11.60, under F n , n^^dx^Fn^n) is tight. Since A n —> oo and 
s n ,i-a has a finite limit, it follows that A n — s n ,i-a —> oo and therefore 

PF n {Tn > Sn,l-a} — t 1 • I 

One can also obtain nonasymptotic lower bounds to the power of the 
Kolmogorov-Smirnov test by using (14.7). For example, application of the 
Dvoretzky Kiefer Wolfowitz inequality (Theorem 11.2.18) yields 

P Fn {T n > S„, 1 —a} > 1 - 2 exp[—2(A n - Sn,i —a) 2 ] , (14-8) 

if n 1 ^ 2 dx(F„, Fo) > A„ and A n > s n ,i-a (Problem 14.2). 

It follows from Theorem 14.2.2 that the Kolmogorov-Smirnov test is uniformly 
consistent in power against alternatives F such that dx(F, Fo) > A, for any 
fixed A > 0. Note, however, that for any fixed n and A, the rejection probability 
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may be less than a; that is, the Kolmogorov-Smirnov test is biased, as shown by 
Massey (1950). 

It also follows from Theorem 14.2.2 that the limiting power of the Kolmogorov- 
Smirnov test against a sequence of alternatives F n is arbitrarily close to one for 
sequences F n tending to To sufficiently slowly. In the opposite direction, by the 
triangle inequality, 

Pp{T n > s„,i- 0 } < P^v^^cIk (F n , F) +n 1 ^ 2 d K (F,F 0 ) > s„,i_ a } , (14.9) 

which implies the power of the Kolmogorov-Smirnov test is poor against se¬ 
quences of alternatives tending to To sufficiently fast (Problem 14.4). More 
specifically, the following holds. 

Theorem 14.2.3 For testing F = Fq at level a, the limiting power of the 
Kolmogorov-Smirnov test is no better than a against any sequence of alternatives 
F n satisfying n 1 ' 2 dic(T’„, To) —> 0; that is, 

limsup Pf„ {T n > Sn,i-a} < a . 

n 

Thus, the Kolmogorov Smirnov test cannot distinguish sequences that are at a 
distance o(n -1 ^ 2 ) from To, where distance refers to the metric da■ In fact, no test 
can have good power against all sequences F n satisfying n 1//2 dK(F n , To) —» 0. To 
prove this statement, consider a smooth parametric model containing To, such 
as a one-parameter exponential family having density of the form 

exp (9T(x) — A(9))dFo{x) . 

Let F n denote the c.d.f. corresponding to this density with 6 = h n n~ x ^ 2 . Note 
that dn{F n , To) = 0(/i n u -1 ^ 2 ) (Problem 14.5). Then, the AMP test sequence 
for testing 9 = 0 (corresponding to To) against 9 n = hnU^ 1 ^ 2 has limiting power 
a if h n —> 0. 

One can also obtain an upper bound to the power against alternatives F n 
satisfying 

n 1,/2 dif (Th,To) — > S < si-a . 

By (14.9), 

Pf{T t i > s„,i- 0 } < Pp{dK(F n , F) > n 1/,2 s n ,l-t* — (1k{F,Fo)} ■ 

Then, by the Dvoretzky, Kiefer and Wolfowitz Inequality (Theorem 11.2.18), the 
last expression is bounded above by 

2exp{-2n[s„,i- a n -1/ ' 2 - d K (F,F 0 )] 2 } . 

Therefore, if F n is a sequence satisfying 

n 1/2 d K (F n ,F 0 ) -» 5 < S!- a , 

then the limiting power against F„ is bounded above by 

2 exp[—2(si_ a - d) 2 ] . 

So far, we have obtained crude upper and lower bounds to the power of 
the Kolmogorov-Smirnov test, and it follows from Theorems 14.2.2 and 14.2.3 
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that, like the parametric situations considered earlier, it is against sequences of 
alternatives F n with 

n 1/2 d K {F n , F 0 ) -j- S (0 < <5 < oo) 

that we expect the power of the test to tend to limits strictly between a and 1. 
Let us now sketch an approach to calculating the exact limiting power against a 
local sequence of alternatives F„. Consider the normalized difference 

d n (t) = n 1/2 [F n (t) - F 0 (t)} , 

and assume that for some function d 

sup | dn(t) — d(t) | —¥ 0 . 
t 

Note the basic identity 

n /2 [F n {t) - F 0 (t)] = n 1/2 [F n (t) - F n (t )] + d n (t) . (14.10) 

Under F„, n 1//2 [P n (t) — F n (t)] has mean 0 and variance 

Fn(t )[1 - F n (t)} -> Fo(t)[l - Fo(t)] . 

For fixed t, the Lindeberg Central Limit Theorem (see Problem 11.13) implies 
that, under F„, 

n /2 [F n (t) - F n (t)\ 4 B{t) , 

where B(t) has the same limiting normal distribution N(0, Fo(t)[l — Fo(t)]) that 
arose when studying the limiting behavior (14.3) of the empirical process B n (t) 
(defined in 14.2) under Fo. Hence, under F n , (14.10) implies that 

n 1/2 [F n (t) - F 0 (t)} 4 B(t) + d(t) ~N(d(t),F 0 (t)[l - F 0 (t)]) . 

Similarly, for any fixed ti,... ,tk, under Fn, 

n 1/2 [F n (t i) - Foih),..., F n (tk) - F 0 (t k )] 4 [B( h) + d(t r),..., B(t k ) + d{t k ) ] . 

By the Continuous Mapping Theorem, it then follows that, under F n 

max n 1,2 \F n (ti) - F 0 (U)\ 4 max | B(U) + d(ti)\ . (14.11) 

1 1 

This result suggests that, under F n , 

sup n 1/2 \F n (t) - F 0 (t)\ 4 sup \B(t) + d(t)\ , 

t t 

where B(t) is the Brownian Bridge process which was introduced at the beginning 
of this section. This suggested result does in fact hold, and so the limiting power 
of the Kolmogorov-Smirnov test against F n can be expressed as 

P{sup | B(t) + d(t)\ > si-d . (14.12) 

t 

The evaluation of this expression involves so-called general boundary-crossing 
probabilities and is beyond the present treatment; see Siegmund (1986) and the 
references given in Shorack and Wellner (1986), Section 4.2. Approximations to 
this limiting power are also obtained in Hajek, Sidak and Sen (1999), Section 7.4. 

The results in this section show that the limiting power of the Kolmogorov- 
Smirnov test against alternatives F n satisfying n 1 ^ 2 dx(Pn, Fo) —> 5 is 0 or 1 
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unless <5 is finite and positive. Moreover, the result (14.12) can be used to show 
that typically, the limiting power is strictly between a and 1. Surprisingly, and 
in distinction to the typical parametric situation, the limiting power can be a 
or 1 against a sequence of alternatives F n satisfying n 1 ^ 2 dK{F n , Fq) 8 even if 
0 < 5 < oo; for a construction, see Problem 14.6. 


14-2.2 Extensions of the Kolmogorov-Smirnov Test 

The basis of the Kolmogorov-Smirnov test is a measure of discrepancy between 
the hypothesized distribution function To and the empirical (cumulative) distri¬ 
bution function F n . Any such statistic is called an EDF statistic. In particular, 
if d is a metric on the space of distribution functions, any statistic of the form 
d(F„,Fo) is an EDF statistic, with the choice d = da corresponding to the 
Kolmogorov-Smirnov statistic. 

A second class of EDF statistics is given by the Cramer-von Mises family of 
statistics 

/ OO 

[F n {x) - Fo(x)] 2 ip(x)dFo(x) . 

-OO 

Taking ip(x) = f yields the Cramer-von Mises statistic, while 
i>(%) = {-Fo(a:)[l - To(a;)]} _1 

yields the Anderson-Darling statistic. Both choices will be studied in Section 14.5. 

Tests based on EDF statistics can be used to test composite null hypothesis. 
For example, suppose it is desired to test whether the underlying c.d.f. is Fg for 
some 6 lying in a parameter space ©o, and that 9 n is some reasonable estimator 
of 9. Then, an EDF test statistic is defined by some measure of discrepancy 
between F n and Fg . For example, for testing normality with unspecified mean 
/j, and variance a 2 , a Kolmogorov-Smirnov test statistic is given by 

sup \F n (x) - $( I , (14.13) 

where <&(•) is the standard normal c.d.f. and () is the MLE for (/r, cr) 
under the normal model. It is easy to see that, under the null hypothesis, the 
distribution of (14.13) does not depend on (/r, a) (Problem 14.9), and critical 
values can be approximated by simulation. Many other tests have been proposed 
to test for normality; see D’Agostino and Stephens (1986). 

Unfortunately, for testing general parametric submodels indexed by 9 , the 
asymptotic null distribution of an EDF statistic with estimated parameters de¬ 
pends on 9, which limits their use. For discussion and references to the literature 
of this problem, see D’Agostino and Stephens (1986) and De Wet and Randles 
(1987). An alternative approach based on the bootstrap is given in Beran (1986) 
and Romano (1988); see Example 15.6.5. 

EDF tests can be extended to the case where the observations are not real¬ 
valued. Suppose AT,..., X n are i.i.d. P (on some arbitrary space). The natural 
extension of the empirical c.d.f. is the empirical measure, defined by 

1 n 

Pn(E) = - y /{AT e E} . 
n —' 

i =1 
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Then, EDF test statistics can be constructed by some measure of discrepancy 
between P n and a hypothesized Po (or in the composite null hypothesis 
case). See Shorack and Wellner (1986), who also discuss the two-sample problem 
of comparing two samples by a measure of discrepancy between the empirical 
c.d.f.s of the samples. 


14.3 Pearson’s Chi-squared Statistic 

14-3.1 Simple Null Hypothesis 

In this section, we return to the simple goodness of fit problem for categorical 
data that was briefly considered in Example 12.4.6. As before, we are dealing 
with a sequence of n independent trials, each resulting in one of k + 1 possible 
outcomes named 1 ,... ,k + 1. The jth outcome occurs with probability pj on 
any given trial, so that Hjli Pi = 1- Let Yj be the number of trials resulting in 
outcome j. The joint distribution of (Yi, ..., Y k +i) is the multinomial distribution 

P{Yi = 2/i,..., Y k+ 1 = 2/fc+i } = - ,Pp '' -PpCi 1 , (14.14) 

2/i- • 2/fc+i' 

with Hjli Pi = n - The parameter space is 

k 

= {(pi,...,p fc ) G IR fc : pi > 0, ^Pj < 1} (14.15) 

i=i 

since p k +i = 1 - Ylj=iPi- 

Consider testing the simple null hypothesis Pj = 7 r; for j = 1,..., k +1 against 
the alternatives pj ^ 7r j for some j. It will be assumed that ni,...,n k is an 
interior point of Q. 

A standard test, proposed by Pearson (1900), rejects for large values of 
Pearson’s Chi-squared statistic, given by 


k +1 


Qn -^2 

i=i 


C Yj - rin-j) 2 


(14.16) 


This test was already introduced in Example 12.4.6 as an approximation to the 
likelihood ratio test, and it was shown that the limiting null distribution of Q n 
as n —> 00 is the Chi-squared distribution with k degrees of freedom. Below, 
we will give a direct argument of this result in Theorem 14.3.1. Thus, if c k p- a 
is the 1 — a quantile of xl> then the test that rejects when Q n > c k p- a is 
asymptotically level a. The accuracy of the Chi-squared approximation to the 
exact null distribution of the test statistic is discussed for example by Radlow 
and Alf (1975); for more accurate approximations in this and related problems, 
see McCullagh (1985, 1986) and the literature cited there. 

Consider next a fixed alternative 


(Pi 1 ■ ■ • ,Pfc+l) 7^ (TTl, • • ■ ,7Tfc+l) . 
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If, for some j, pj ^ 7 Tj, then 


s-\ ^ / * j \2 P 

Q n > n( - 7 Tj) —> 00 

n 

since Yj/ri —>• pj, by the law of large numbers. Hence, the power against such an 
alternative tends to one. 

As in Example 11.2.5, a more discriminating result is obtained by considering 
local alternatives p 4^ of the form 

pj"' 1 = 7 Tj + n^ 1 ^ 2 hj , 

where = 0- We shall now show that, against such an alternative 

sequence, the limiting power is nondegenerate. 


Theorem 14.3.1 Assume the above multinomial setup. 

(i) Under the null hypothesis H: pj = 7 Tj for j = 1,..., k + 1, Q n -4 xL the 
Chi-squared distribution with k degrees of freedom. 

(ii) Under the alternative hypothesis (sequence) K: p^ = 7 Tj + n~ 1 ^ 2 hj where 

hj = 0, Q„ —> Xfc(A), the noncentral Chi-squared distribution with k degrees 
of freedom and noncentrality parameter 


,2 

a = y ^ 

' ^ IT A 


7 = 1 


(14.17) 


(Hi) The power of the x 2 test based on Q n against the alternatives in (ii) with 
not all the hj equal to 0 tends to a limit strictly greater than a and less than 1. 
This holds if the test is carried out using an exact level a critical value, or any 
critical value sequence tending to Ck,i~ a in probability (such as Ck,i~ a itself). 


Proof. The proof of (i) is an application of the multivariate CLT followed by 
the continuous mapping theorem. Let V n be the k x 1 vector defined by 

y n T = n 1/2 (— -TTl,...,^ -TTfc) . (14.18) 

n n 

By the multivariate CLT, V n -4 1V(0, E), where the k x k covariance matrix E 
has {i,j) entry (Problem 14.12 (i)) 


O'i j — 


7Tj(l - 7Tj) 
TViTTj 


It can be checked that E has inverse E 1 
(Problem 14.12 (ii)) 


a i,j — 



if j = i 

otherwise. 


(14.19) 


A, where A has ( i,j ) entry given by 


if j = i 

otherwise. 


(14.20) 


Hence, A^^ 2 Vn —> N(0,lk), where Ik is the k x k identity matrix. By the 
Continuous Mapping Theorem 11.2.13, 

{ A^V n ) T {A^V n )A X l. 
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But, the left hand side is Vn AV n , which in turn is equal to 

k 


71 'j n Kk+i n 


n 


k k , 

n x —. ^—-, Yi 


Yj 


The last term reduces to 
* ,Yi 


3 = 1 


»£(- - 7r l)f/ 7r fc+i = _ 7r fc+1 ) 2 /7r fc+ i , 

z —i n n 


where, in the last equality, we have used Yj = ra — Yk+i and i r? = 

1 - TTk+l- Thus, Vn AVn = Qn- 

The proof of (ii) is similar. First, note that 

Vn = n 1/2 (-^- -pF..---! F -pj- n) ) + (hi,...,h k ) . 

It follows from the Cramer-Wold device and the Berry-Esseen Theorem (Problem 
14.13) that, under the alternative sequence, 


Vn 4 N(h, E) . 


(14.21) 


Therefore, 

A l/2 V n A N(A 1/2 h,I k ) 

and so 

(A 1/2 V n f(A 1/2 V n )A X l(Y) , 

where 

A = (A 1/2 h) T (A 1/2 h) = h T Ah ; 

simple algebra shows that h T Ah agrees with the expression (14.17) for A and the 
proof of (ii) follows. 

The proof of (iii) is left as an exercise (Problem 14.15). ■ 


We are now in a position to prove an optimality result for Pearson’s Chi- 
squared test in the multinomial goodness of fit problem. The problem is to test 
the null hypothesis p = n, where n is the vector with j th component nj. The 
goal is to show Pearson’s Chi-squared test is asymptotically maximin over an 
appropriate (shrinking) set of alternatives p which tend to n at rate n _1 4 First, 
note that the Information matrix I(p) with (i, j) entry mj is given by 


di,j 


— -— 

Pi Pk +1 

Pk + 1 


if j = i 

otherwise. 


(14.22) 


(Problem 14.14). Let h T = (hi,...,hk) and set hk+i = — Ei=i hi so that 
E?=i hi = 0. Then, 


k+1 

\I 1/2 (n)h] 2 = J2 


i= 1 
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Theorem 14.3.2 Assume the above multinomial setup. 

(i) For any test sequence 4> n such that E 1r (</>n) —» a, 

fc+i ? 

limsupinf{S 7r+hn -i/ 2 (0n) : ^ — > b 2 , 7r + hn~ 1/2 G 11} 

n—too , 'Ki 


< P{xl(b ) > Ck,i—cx} • 

(ii) Pearson’s Chi-squared, test 0*, which rejects when 


(14.23) 


fc+i 

E 


(Yi - nm ) 2 


1 —CK , 


is asymptotically maximin in the sense that the inequality in (14-23) is an equality 
when 4>n = 4>n- Thus, maximizes 

fc+i 2 

lirn inf{£' 7r+h7l -i /2 {(j>n) ■ ^ — > b 2 , n + hn~ 1/2 € 11} 

n. 7Tj 


among all tests with asymptotic level a. 


Proof. Theorem 13.5.4 immediately implies (i). To prove (ii), assume the op¬ 
posite. Let R denote the right side of (14.23). Then, there exists a sequence of 
alternatives h ^ (with ith component denoted h^) satisfying 
fc+1 r t (n.) 1 2 fc+1 

ET’-o 

i =1 * i -1 

such that 

E 7 r + kWn-tiK) £ f 

and i is strictly less than R, Since 

fc+1 r 

y 1[h Y± > b 2 , 
ti ni 

we cannot have h^ —> 0 for every i. 

We also cannot have —> oo for any i, for then 

^V+fcMn- 1 / 2 (</>«) 1 ) 

which would be a contradiction since R < 1. To see why this expectation would 
tend to 1, suppose h^ —> oo (and a similar argument holds if h\ n ^ — oo). 

Then, 

TZ / j.* \ ^ d f nTTi) 

E n+h (.-n) n -l/2{(p n ) > P n+h (n) n - 1/2 < -—- 

I 


^ Cfcjl — c 


> *Wh(«)„-l/2 | nl/2 (^ “ > C M-a} 


= P_ 


+ fe ( n ) n - 1 / 2 I n 


1/2 




- - (TTi + /il n) n- 1/2 ) 
n 


+ h (n) > c /2 

+ n i > C fc,l-c 


(14.24) 



594 14. Testing Goodness of Fit 


But, by Chebyshev’s inequality, 

n 1/2 -fa-H^V 1 ' 2 )' 

n 

is bounded in probability, since it has mean 0 and variance bounded by one. 
Hence, (14.24) tends to one and so 

E n + hMn- 1 / =(0n) 1 ' 

The same conclusion holds along any subsequence rife satisfying hi nk) oo. 

Thus, we must have h\ n ' > x 1 for every i. By passing to subsequences which 
converge, assume 

k +1 [^(00)12 

hl n) -»■ < oo , and A = V ™—L > b 2 . 

2=1 

The limiting power was obtained in Theorem (14.3.1) with hf 1 '' = hi fixed, but 
the argument applies with obvious modifications to sequences that converge; 
moreover, this limiting power is 

P{XkW > Cfc,i-a} > P{xl(b 2 ) > c k ,i- a } , 

since the family of Chi-squared distributions has monotone likelihood ratio. This 
again yields a contradiction. The same conclusion holds for any subsequence, 
because we can apply the argument to further subsequences where hf' 1 converges 
along the subsubsequences. ■ 

The above result states that the Chi-squared test is asymptotically maximin for 
the multinomial goodness of fit problem. The same result holds for the likelihood 
ratio test (Problem 14.16). Moreover, the above argument shows that the worst 
case power over alternatives 7r + hn _1 ^ 2 with 

y h 2 /-7Ti > b 2 
1=1 

occurs (asymptotically) when = b 2 . 

14-3.2 Chi-squared Test of Uniformity 

So far, we have been concerned with testing the parameters of a multinomial 
model. Let us now return to the problem stated at the beginning of Section 
14.2, where X\,... ,X n are i.i.d. real-valued observations with c.d.f. F, and the 
problem is that of testing the null hypothesis H that F = Fo, where Fo(t) = t 
is the uniform c.d.f. on (0,1). To reduce this problem of goodness of fit to that 
of testing a multinomial hypothesis, fix a positive integer k and divide the unit 
interval into k + 1 subintervals of length l/(k + 1); for j = 1 ,..., k + 1, let Y) be 
the number of X\ observations that fall in the interval Ik,j defined by 

h,j = {U-U/(k+l),j/(k+l)) . 

Under the null hypothesis, the joint distribution of (Yi,..., Yk+ 1 ) is multinomial 
based on n trials and equal class probabilities of l/(k + 1). So, one can test H 
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by using the Chi-squared test which rejects for large values of 

k+l j\ ’ _ _n_\2 

Oj fc + 1> 

/ j ri 

j = 1 fc +1 

It follows that, for fixed fc, the Chi-squared test is consistent against any al¬ 
ternative distribution F which does not assign equal probability to all intervals 
Ik,j ■ 

Next, consider a sequence of alternative densities /„ of the form 

/„(*) = 1 + b„u(x) , (14.25) 

where u satisfies fg u(x)dx = 0 and f u 2 {x)dx < oo. Then, f n assigns probability 

[ [1 + b n u{x)]dx = , 1 - + b n [ u{x)dx 

Jlkj k + 1 J U:, 

to Ik,j■ By Theorem 14.3.1 (ii), with fc fixed and 6„ = /in -1 ' 2 , the limiting power 
of the Chi-squared test is given by 

P{Xk{^k) > Ck,l- a } , 

where 


fc+i 

At, = h (fc +1) 'y ( 

3 = 1 


1 r k,j 


u(x)dx 


Note that, if 


u(x)dx 


Hk, 


is not zero for at least one j, then the noncentrality parameter A*, is positive. 
Also, if u is continuous except at most a finite number of points, then 


A j. 


Aoo = h 2 / u 2 (x)dx as fc —» oo . 
Jo 


(14.26) 


Note that for any fixed fc, A k can be 0 even if Aoo > 0. Indeed, the Chi-squared 
test has power equal to the size of the test against any distribution that has mass 
l/(fc + 1) on each subintervals, and so for fixed fc, the Chi-squared test is not 
consistent against all alternatives. 

Therefore, it is tempting to allow fc = k n to increase with n in order to obtain 
power against an even broader range of alternatives. On the other hand, if A k 
approaches Aoo quite fast, then it would be undesirable to let fc„ increase too 
quickly. To illustrate this point, consider the following example. Let Uo(x) = 1 
for x < 1/2 and uo(x) = —1 for x > 1/2. Then, A k = Aoo = h 2 for all fc odd. If 
fc = 1, then the limiting power of the Chi-squared test against f„ given by 

f n (x ) = 1 + hn~ 1/2 u 0 (x) 


P{xi{h 2 ) > ci,i- a } . 

If instead, fc = 2j + 1 with j > 1, the limiting power is exactly 

P{Xk(h 2 ) > Ck,i- a ) ■ 
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Notice that the noncentrality parameter is the same for all odd k. But, for fixed 
h, this probability is decreasing in k its limiting value is a as k —¥ oo, as shown 
by the following lemma. 

Lemma 14.3.1 Let M(k,h) be defined as 

M(k, h) = P{ X l(h 2 ) > c M -4 , (14.27) 

where Xk(h 2 ) denotes a noncentral Chi-squared variable with k degrees of freedom 
and noncentrality parameter h 2 . 

(i) For fixed h, M(k, h) is nonincreasing in k, and is strictly decreasing if h ^ 0. 

(ii) If hk —> h for some finite h, M(k,hk) —> a as k —» oo. In particular, 
M(k, h) —> a as k —> oo. 

(in) If (2fc) _1//2 /r| —> c as k — > oo, then 

M(k, hk) -> 1 - $(zi_ a - c) . 

Proof. The proof of (i) is left as an exercise (Problem 14.17). To prove (ii), let 
Z\ , Z 2 , ■ ■ . denote i.i.d. standard normal variables. By the Central Limit Theorem, 

k 

(2 k)~ 1/2 (J2 Z 2 - k) 4 N( 0,1) , (14.28) 

i=1 

which implies 

(2 k)~ 1/2 (c K1 - a - k) -4 zi-a (14.29) 

as k —> 00 . Of course, the result (14.28) holds even if the i = 1 term is omitted 
from the sum. Hence, 

k 

M(k , h k ) = P{(Z 1 + h k ) 2 +^Z 2 > c M —4 

i =2 


k 

= P{(2A;) _1/2 (Z 1 +/ifc) 2 + (2A;) _1/2 (^^ Zf—k) > (2k)~ 1/2 (c k ,i- a -k)} . (14.30) 

i =2 

By (14.29), the right side of the last expression tends to z i- a - Also, as k oo, 

(2fc)- 1/2 (Zi +h fc ) 2 4o. 

By Slutsky’s Theorem, the left side of (14.30) tends in distribution to N(0, 1). 
The result (ii) follows by another application of Slutsky’s Theorem. The proof of 

(iii) is similar. The only difference is that the term 

{2k)~ 1/2 {Z 1 + hk) 2 4 c 

if (2 k)~ 1/2 hl -¥ c. U 

Thus, the results in (i) and (ii) of Lemma 14.3.1 show that the choice k = 1 is 
optimal for the situation with u = uo- The point is that increasing k too much 
decreases the limiting power. Furthermore, if k is quite large, the limiting power 
is approximately a. This latter conclusion applies to any alternative sequence of 
the form (14.25) with b n = n -1 ^ 2 ; also see Problem 14.19. 
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Mann and Wald (1942) considered the optimal choice of k n - In particular, let 
d K (F, F 0 ) = sup \F(t) - t\ . 

t 

Mann and Wald (1942) determined an optimal rate for k n which satisfies k n = 
0(n 2/,s ), and show that with such an optimal rate the limiting power is 1/2 > a 
against a sequence of alternatives F n satisfying n 2 F d K (F n , Fq) —> oo. This result 
on optimal rates is somewhat contradicted by the above analysis and other results 
that indicate that the best choice of k n is rather small; see Stuart and Ord (1991, 
Chapter 30).) 

It is interesting to compare the results of Mann and Wald with the fact 
that the Kolmogorov-Smirnov goodness of fit test has limiting power one if 
n 1 / 2 dfc{F n , Fo) oo, as shown in Theorem 14.2.2. It follows that the Kol¬ 
mogorov Smirnov test (and this is also true of Cramer von-Mises test) is 
asymptotically superior to the Chi-squared test in this case. However, it has been 
pointed out that this superiority is connected with the choice of distance with 
which one measures deviations from To. If one replaces the Kolmogorov-Smirnov 
distance with an L 2 distance based on the integral of the squared difference 
in densities (satisfying smoothness conditions), then the Chi-squared test can 
asymptotically outperform the Kolmogorov-Smirnov test; see Ingster (1993). We 
will later obtain further results, since Chi-squared tests can be viewed as a special 
case of the more general class of Neyman smooth tests that will be studied in 
Section 14.4. 


14-3.3 Composite Null Hypothesis 


Next, we consider the application of the Chi-squared test to composite hypothe¬ 
ses. First, suppose data (Yi,..., Yfc+i) has the multinomial distribution (14.14), 
where Y) is the number of trials resulting in outcome j and pj is the probability 
of the jih outcome for any given trial. The full model allows the Pj to vary freely, 
subject to their being nonnegative and summing to one. 

Consider testing the null hypothesis that the pj are of the form 

Pj = . j = + 1, 


where the fj are known functions of (3 = (/3i,... ,(3 q ), and (3 varies in a subset 
of IR 9 for some q < k. For testing the simple null hypothesis that Pj = fj(/3), 
1 < j < k, for a fixed value of /3, the Chi-squared test is based on the statistic 


Qn(f3) 


y (Yj-nfjm 2 
trt nfj(0) 


(14.31) 


If (3 is unspecified, Fisher (1928b) suggested the test statistic Q n (f3n), where $„ 
is a MLE of (3 under the null hypothesis submodel (or any efficient estimator). 
Following Fisher, Neyman (1949) recommends Q«(/3n), where f3 n is chosen to 
minimize Qn(/3) (in which case /3„ is called a minimum Chi-squared estimator). 
Not surprisingly, it is typically the case that, under the null hypothesis, 

Qn(Pn) — Qn{(3n) —> 0 . 
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Example 14.3.1 (Fisher linkage model) Fisher (1928b) postulated a genet¬ 
ics model with 4 possible types of offspring, whose probabilities are of the 
form 

(Pi.P 2,P3,P4) = i (2 + /3,1-/3,1-/3, 0) 

for some 0 £ (0,1). In the above notation, fi(/3) = 2 + 0, f2(0) = h{0) = 1-/3, 
and f 4 {0) = /3. (The parameter 0 depends on the linkage between the two genetic 
factors under consideration.) To test the validity of such a model, a Chi-squared 
test can be employed. To estimate 0, it is easily checked (Problem 14.23) that 
the likelihood equation is 


Y\ ( Y 2 +Y 3 ) Y 4 

2 + 0 1-0 + 0 


(14.32) 


which reduces to a quadratic equation, and the MLE 0 n is the root of this 
equation that lies in [0,1]. The resulting test statistic is then Q„(0n)- ■ 


Just as in the case of simple null hypothesis, if the null hypothesis is true, then 
(Problem 14.20) 

2 log(7? n ) - Q n {0n) 4 0 . (14.33) 

Thus, under the assumptions of Theorem 12.4.2 (iii), it follows that, under the 
null hypothesis, 

Qn(0n) -+ Xk-q ■ (14.34) 

As in the case of a simple null hypothesis, the problem of testing a composite 
hypothesis of goodness of fit can be reduced to the multinomial case. Suppose 
AT,..., X n are i.i.d. according to a model { Pg , 6 £ 12}, where 12 C IR t . The null 
hypothesis specifies 9 = f(0) for some fixed function / from 1R'' to IR fe . Now, 
partition the range of the X; into k + 1 sets E i,..., Ek+i, and let Pg{Ei} be the 
probability of Ei under 6. Let Yj denote the number of X t falling in Ej and let 


Qn(0) 


y Od-nP/wlE ,}) 2 

n p f(i3){Ei} 


Then, a test can be based on Q n (0n), where 0„ is an estimator of 0 assuming 
the null hypothesis submodel. 

Just as in the case of a simple null hypothesis, the choice of k (and now also 
of the sets Ei) is complex; note the references in the previous subsection. 1 In 
addition, a further complication arises, which is the choice of estimator 0 n . If the 
estimator is an efficient likelihood estimator based on the likelihood of the catego¬ 
rized data Yi ,..., Yk+ i, then we have returned to the setting of the multinomial 
case considered at the beginning of this section, and the limiting distribution 
of Q n (0n) is Chi-squared. On the other hand, one might also estimate 0 based 
on the likelihood of the original sample Xi,... ,X„. In this case, Chernoff and 


1 For randomly chosen partitions, see Chapter 2 of Greenwood and Nikulin (1996) 
and Theorem 5.7.1 of Lehmann (1999). Data-based partitions occur, for example, when 
the number of observations falling in any set is small and one then combines such sets. 
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Lehmann (1954) showed that Q„(p„) need not be Chi-squared. For an example, 
see Problem 14.24. 


14.4 Neyman’s Smooth Tests 

Suppose that X\.... ,X n are i.i.d. according to a probability distribution P on 
some sample space S. Consider testing the simple null hypothesis P = Po, where 
Po is some fixed probability distribution on S. When S = 1R, one possible test 
is the Kolmogorov-Smirnov test, discussed in Section 14.2, which was seen to 
be consistent in power against any fixed alternative, and uniformly consistent 
against the large class of distributions F with cIk(F,Fo) > A for any small A. 
Even so, the Kolmogorov-Smirnov test can have poor power against local alter¬ 
natives; see Problems 14.6 and 14.7. In fact, whenever the family of alternative 
distributions is large, it is unlikely that there will exist a single test that will 
perform uniformly well across against all of them, and certainly no UMP test 
will exist. For a q.m.d. family indexed by a real-valued parameter, one can con¬ 
struct AUMP tests, as discussed in Section 13.3. However, even if the family of 
alternatives is q.m.d. and indexed by a parameter in IR 2 , there exists no test that 
is asymptotically uniformly optimal (Problem 14.25). Thus, one goal might be to 
construct tests that perform well across a fairly broad range of alternatives. In 
this spirit, Neyman (1937b) considered large parametric families of alternatives 
and derived tests that asymptotically maximize minimum (and average) power 
against these alternatives. Such tests will be described in this section. 

Consider the parametric model of densities po(x) with respect to Po given by 

k 

pe(x) = C k (9) exp^ejTjix)} , (14.35) 

l=i 

where k is some positive integer so that 9 £ IR fc . Setting To(x) = 1, the functions 
Ti,... ,Tk are chosen so that To,....T), is a set of orthonormal functions on 
P 2 (Po), the space of functions that are square integrable with respect to Po; that 
is 

Cov 0 [T i (X 1 ),T j (X 1 )]= [ T i (x)T j (x)dP 0 {x) = Sij , 

J s 

where 5ij = 1 if i = j and 5ij = 0 if i ^ j. This implies Eo(Tj) = 0 for 
j =s 1,..., k. The normalizing constant Ck{8) is given by 

* k 

C k {9) = { / explV^T^dPotx)}- 1 . (14.36) 

Js j=1 

Let Q k denote the set of 9 where the integral in (14.36) is finite so that pe is a 
proper density. We will also assume 0 is an interior point of in which case 
the family of densities constitutes a fc-parameter exponential family of full rank. 
The null hypothesis asserts 9 = 0. 

Example 14.4.1 (Testing uniformity using Legendre polynomials) As a 

prototype, consider the goodness of fit problem of testing that Xi,... ,X n are 
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i.i.d. from the uniform distribution on [0,1], so that S = [0,1] and Po is the 
uniform distribution on [0,1]. For this problem, Neyman (1937b) chose Tj(x) to 
be a polynomial of degree j. Specifically, set Tq{x) = 1, Ti(a;) = V3(2x — 1), 
T 2 (x) = %/5(6a: 2 - 6x + 1), T 3 (x) = s/7(20x 3 - 30x 2 + I2x - 1), and so on, so 
that Tj is constructed to be a polynomial of degree j such that it is orthogonal to 
To,... Tj~i, and its square integrates to one. The polynomials Tj are the so-called 
normalized Legendre polynomials. ■ 

Returning to the general case, we next derive Neyman’s test as a special case of 
Rao’s score test for testing 9 = 0 in the parametric model. The family of densities 
(14.35) is a fc-parameter exponential family in natural form. By Example 12.2.6, 
this family is q.m.d. at 9 = 9o = 0. By Theorem 12.2.2, the score vector at 9q = 0 
(12.73) is given by 

^ = n- 1/2( J-l° g L n (0),..., J-iogLnW) 

0=0 

where L n (0) is the likelihood function 

n k 

Ln(0) = C£(0)ex pEE^'W] • 

*=1 3 = 1 

Hence, 

f) r) n 

— log[L„(0)] = n— log[C fc (6l)] + ^T m (X,) . 

i=l 

But, by Problem 2.16, 

-^-log [C k (e)]=Ee[T m (X i )] , 

which is 0 when 9 = 0 (since we are assuming To (a:) = 1 and T m is orthogonal to 
To). Hence, the score vector at 9o reduces to 

( n n \ 

. (14.37) 

i = 1 i =1 / 

By the orthogonality of the Tj, we have Cov\Ti{X\), Tj (Xi)] = Si,j. Arguing 
directly, the Multivariate Central Limit Theorem implies that, under 9 = 0 , 

Z n 4 N(0,I k ) , 

where Ik is the k x k identity matrix. Moreover, the Fisher Information at 9 = 0 
is 1(0) = Ik- Therefore, Rao’s score test rejects for large values of 

k 

zlr\o)z n = zlz n = J2 z lj . 

3 — 1 

where 

n 

Z n ,j=n- 1/2 J2T3(Xi) ■ 

i=1 


(14.38) 
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Let Ck,i- a be the 1 — a quantile of the ^-distribution with k degrees of freedom. 
By the Continuous Mapping Theorem, 


zlz n 


d 2 

Xk 


and so the test </* which rejects when Z^Z n > Ck,i~ a is asymptotically consistent 
in level. The test <j >* will be referred to as Neyman’s smooth test. (Of course, one 
can always replace Ck,i- a by the exact 1 — a quantile of the finite sampling null 
distribution of Zf)Z ni or the null distribution can be simulated.) 


Example 14.4.2 (Continuation of Example 14.4.1) In this case, 

" _ 1 

= [n~ 1/2 Y, V3(2 Xi - l)] 2 = 12[n(X n - -f] . (14.39) 

i= 1 

Thus, Z^ i is large when the sample mean differs 1/2, from the hypothesized 
mean. Similarly, Z 2 j is large when the first j sample moments differ greatly 
from those of U( 0, lj. ■ 

Example 14.4.3 (The test) As in Section 14.3, consider the goodness of fit 
problem for testing a multinomial distribution with k + 1 categories. For concrete¬ 
ness, suppose Xi ,..., X n are i.i.d., each X t taking the value e, with probability 
Pj , where e,- is the vector with 1 in the jth component and 0 in the remaining k 
components. Then, the chi-squared statistic Q n given by (14.16) can be viewed 
as a Neyman smooth test. Recall V„ given by (14.18) and the matrix A given 
by (14.20). Now, let Z n be the vector A 1 ^ 2 V n , so that Q n = Z^Z n . Further¬ 
more, the probability mass function of Xi can be written in the form (14.35) 
with Tj satisfying n -1 ' 2 JT Tj(Xi) equal to the j’th component of Z n (Prob¬ 
lem 14.26). (Note, however, that unlike the Legendre polynomials of Example 
14.4.1, the functions Tj depend on k, so that we really have a triangular array of 
orthonormal functions.) ■ 


14-4-1 Fixed k Asymptotics 

Assuming the model (14.35) holds, we can apply Corollary 12.4.1 to conclude 
that, under h/n 1 , 

ZlZ n Axl{\h\ 2 ) ■ (14.40) 

We now apply Theorems 13.5.4 and 13.5.5 in order to obtain an asymptotic 
maximin property for 0* . 

Theorem 14.4.1 Assume the model (14-35) and assume 6 = 0 is an interior 
point ofQ k - Consider the problem of testing 9 = 0. 

(i) For any sequence of tests 4>n such that Eo(^> n ) —t a and any b and B satisfying 
0 < b < B < oo, 

lim sup inf{E /in _i/ 2 (4> n ) : b < \h\ < B} < P{xl{b 2 ) > c k , i_ a } , (14.41) 

n — 

where Xk(b 2 ) is noncentral Chi-squared with k degrees of freedom and noncentral¬ 
ity parameter b 2 . 
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(ii) Neyman’s smooth test is asymptotically maximin in the sense that, for 
any 0 < b < B < oo, 

vai{E hn -i /2 : b < \h\ < B} -> P{ X l(b 2 ) > c k , i-«} . (14.42) 

Thus, for any 0 < b < B < oo, <j>n maximizes 

lim ud{E h 1/2 (</>«) : b < \h\ < B} 

n 

among all tests with asymptotic level a. 

Proof. Theorem 13.5.4 implies (14.41) and Theorem 13.5.5 implies (14.42). ■ 

The result (14.41) holds if B = oo (since the inf over a larger set is bounded 
above by the inf over a smaller set). In many cases, one can replace B by oo 
in (14.42) as well. For example, suppose Varg\Tj{X\)\ is a uniformly bounded 
function of 9. Then, (14.42) holds if B = oo (Problem 14.27). This condition is 
satisfied, for example, if the Tj(x) are uniformly bounded functions of x, as they 
are in Neyman’s choice of the Legendre polynomials. 

Theorem 14.4.1 states an asymptotic maximin property over alternatives 9 that 
are 0(n -1 ' 2 ) from 9 = 0. Of course, Neyman’s smooth test is also consistent in 
power against any fixed 9^0. Actually, it is consistent in power against a broad 
range of alternatives, not just alternatives in the parametric model (14.35). 

To make this statement more precise, first consider Neyman’s original con¬ 
struction with k — 1 for testing the hypothesis of uniformity, as described in 
Example 14.4.1. Then, the test statistic reduces to (14.39). The test statistic is 
designed to have power against distributions with mean not equal to 1/2 and it 
serves this purpose. For, under an alternative distribution P on (0,1) with mean 
p(P) 1/2, the power of the test which rejects when Z 2 :1 > ci,i_ a tends to 1. 
To see why, note that by the Weak Law of Large Numbers, 

(X - i) 2 4 ( M (P) - i) 2 > o , 

and so 

12n(X — -^) 2 4 oo . 

Therefore, by Slutsky’s Theorem, 

P{12n(A'„ — —) > ci,i— a } —t 1 . 

The point is that the test will be consistent against any alternative P with mean 
p(P) ^ 1/2, even if P is not a member of the parametric model (14.35). 

Similarly, for k > 1, Neyman’s test will be consistent against any distribution 
P, as long as the first k moments of P are not identical to the first k moments 
of the uniform distribution (Problem 14.28). Thus, Neyman’s test for testing 
P = Po has good power across a broader range of distributions than just the 
original parametric model (14.35). 

Example 14.4.4 (Limiting Power Against a Contiguous Sequence) Cor 

a sequence of alternative densities of the form 

/n(x) = 1 + bnU(x) , 


(14.43) 



14.4. Neyman’s Smooth Tests 603 


where 6„ —> 0 and u satisfies 

/ u(x)dx = 0 . 

J o 

Assume sup|u(a;)| < oo, so that /„ is a density for b n small enough. If we set 
b n = hn- 1 ' 2 , we can calculate the limiting power of Neyman’s smooth test against 
f n as follows. The family of densities 1 + 9u{x ) is q.m.d. at 9 = 0 (Problem 12.6) 
with score function n~ x ' 2 JT u(Xi). If P n denotes the probability distribution 
with density /„ with b n = hn _1 ' 2 , then P™ is contiguous to Pq. Under 6 = 
0, {Zn ,n~ 1 ' 2 u(Xi)) is asymptotically multivariate normal. By the multivariate 
generalization of Corollary 12.3.2 obtained in Problem 12.33, under f n with b n = 
hn' 1 ' 2 , 

zZ 4 N(c,I k ) , 

where c is the vector with j'th component given by 

Cj = Cov(Z n ,j , hn- 1 ' 2 ^2 u (Xi)) = h(Tj,u) , 


and 


Hence, under /„, 


(Tj,u) = / Tj(x)u(x)dx 


Z.. Z„ 


^Xl(5 2 ) 


(14.44) 


where 


5 2 =h 2 Y J (T j ,u) 2 . 

3 — 1 


Thus, the limiting power is M(k, 8 2 ), with M{k, h ) defined by (14.27). Note that 
if u is represented as u(x) = Sj=i 7 jTj(x), then by Parseval’s identity (see A.7), 


i-i 


u 2 (x)dx . 


Thus, Neyman’s test has limiting power exceeding a against alternatives of the 
form (14.43) with b n x n -1 ' 2 if u is in the span of Ti,..., Tk. ■ 


I 4 . 4.2 Neyman’s Smooth Tests With Large k 

In the previous section, Neyman’s smooth test was shown to be an asymptotically 
maximin procedure for the parametric model (14.35) with k fixed. Obviously, the 
larger the value of k, the greater the number of orthogonal directions used to 
construct the test statistic. For fixed k, consistency of Neyman’s smooth test holds 
for a restricted class of alternatives. For example, Neyman’s construction results 
in a test of uniformity that is consistent in power against any distribution that 
does not have the same first k moments as that of the uniform distribution. This 
suggests the possibility that, if we let k increase with n, we can obtain consistency 
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against all distributions because on the unit interval, a distribution is uniquely 
determined by its moments; see Feller (1971), Section VII.3. To investigate this 
possibility, we now develop some basic properties of the test based on 

k n 

S n , kn = £ Z 2 j , (14.45) 

3=1 

where k n is some fixed sequence satisfying k n —1 oo. 

For fixed k, we saw that, under Ho, 

k 

E r? 2 d 2 

£n,j ~* Xk ■ 

3=1 


If k is large, the Chi-squared distribution with k degrees of freedom is 
approximately N(k, 2k), and so it is reasonable to expect that, under Ho, 


Et 


7 Z - Jc 

l A/ n 


(2fc„)V2 


4 iv(o, i). 


In order to prove this convergence, we need the following lemma, due to Bentkus 
(2003), which can be viewed as a multivariate version of the Berry-Esseen Theo¬ 
rem. In the statement of the result, let £ k denote the class of Euclidean balls in 
IR fc ; that is, the family of sets {y £ IR fc : \x — y\ < r} as x £ IR fc and r > 0 vary. 
Also, let Cfc denote the class of convex sets in lR fc . 


Lemma 14.4.1 Let Yi,Y 2 , ... ,Y n be i.i.d. random vectors in IR fe with mean 
vector 0 and k x k identity covariance matrix I k - Let (3 = _E(|yi| 3 ), and let Z ^ 
denote a multivariate normal random vector with mean 0 and covariance matrix 
Ik- Then, 


sup 

sec fc 


P{n~ 1/2 Yi £ B} - P{Z (k) £ B} 

i= 1 


< 400fc 1/4 /In“ 1/2 . 


If Ck is replaced by £ k , then the right side can be replaced by the upper bound 
C/3n, where C is an absolute constant (independent ofk). Hence, 


sup 

teIR 


P{|n- 1/2 ^V| 2 < t} - P{\Z (k) \ 2 < t} 

i= 1 


< C(3n~ 1/2 . 


We now apply the lemma with 


Yi = (Ti(X i ),...,T fc (X 1 )) 


(14.46) 


so that 


Sn.k — 


,-1/2 




Note that 


P = E ([Tl(Xi) + ■ • • + T, 2 (W)] 3/2 ) 
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By Minkowski’s Inequality (Problem 14.30), 

p 2 ' 3 K^EWT^Xi )] 3 } 2 ' 3 . 

3= 1 


If 


sup£[|T : i(X i )| 3 ] < B < oo , 

3 

then, (3 < Bk 3 ^ 2 . Hence, the following is true. 


(14.47) 


Theorem 14.4.2 Consider S n ,k n given by (14-45), where 


Z n ,j =n~ 1/2 J2 T j( x i) , 

i=l 

and let To = 1, and To, Ti, T 2 , • • • be an infinite sequence of orthonormal functions 
on L 2 (P 0 )- Assume 


sup Ep 0 [\Tj (A'i) | 3 ] = B < 00 . (14.48) 

3 


If k n —> 00 and k^/n —> 0, then, under P = P Q , 


(2 fe n )V2 


4 AT(0,1) . 


PROOF. Apply the lemma with V); given by (14.46). Then, 

I P{Sn,k n < t{2k n ) 1/2 + fe n } - P{\Z (kn \ 2 < t(2fcn) 1/2 + fen} 


is bounded above by 

(Sfcn) 3/2 n- 1/2 -i 0 . 


But, by the Central Limit Theorem, 

P{\Z (kn) \ 2 < t(2fe„) 1/2 + fen} -1 $(t) , (14.49) 


where 4> is the standard normal c.d.f., and the result follows. ■ 

Under the assumptions of Theorem 14.4.2, the sequence of tests that rejects 
when 


(2 fen ) 1 / 2 

is asymptotically level a. 


(14.50) 


Example 14.4.5 Let 

Tj(x) = \Z2cos(njx) . 

Such a choice arises in the construction of the Cramer-von Mises test, which will 
be discussed further in Example 14.5.1. Under the null hypothesis P = Po = 
17(0,1), 

Ep 0 [\Tj{ ADI 3 ] < y/2 Ep 0 {T 2 (X z )\ = V2 . 

Hence, the condition (14.48) is satisfied. ■ 
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Next, we consider the power of (14.50) (with k n —> oo) against a fixed al¬ 
ternative. As in Theorem 14.5.1, suppose P is any probability distribution such 
that 


Ep[Tj( AT)] ^ Ep 0 \Tj(Xi)\ 


for some j. Then, for such a j, 


-'n,3 

n 


1 n 


4 { EpIt ^ x ,)}} 2 > o , 


by the Weak Law of Large Numbers. Hence, 


'n,K n n 

(2fc„)l/2 - 


(2fc„) 1/2 



n_n 


( 2 fcn ) 1 / 2 
n 


P 

—>• OO 


if fcn/ri —>• 0. Hence, the test (14.50) (or the test that rejects if S n ,k n > Ck n ,i- a ) 
satisfies 


P{- 


— k r 


(2k r 


u/ 2 


7r ■ — k 
> Zi- a } >P{-' 3 kn 


(2k 


1/2 


> Z\—a} —> 1 


and is therefore pointwise consistent in power against P. 

Note that the condition k n /n —> 0 is a sufficient condition to ensure the test 
statistic [S„,k n — k n \/(2k n ) 1 ^ 2 tends to oo in probability under an alternative 
P. The stronger condition k\/n —> 0 is sufficient to show asymptotic normality 
under the null hypothesis. These conditions can be weakened, but the message is 
that one can obtain consistency against a broad family of distributions by letting 
k increase with n. 

Next, we discuss the limiting power of the test (14.50) against a local sequence 
of alternatives. Suppose we consider alternatives of the form (14.35) used in 
the construction of Neyman’s smooth tests. Specifically, consider the family of 
densities indexed by £ IR. given by 


Pe 1 (x) = Ci (0i) exp[#iTi (*)] . 

Fix h > 0. For testing 9i = 0 versus 6 1 = hn~ x ^ 2 at level a, the limiting power 
of an asymptotically most powerful test sequence is 1 — <S>(zi- a — h ), by Lemma 
13.3.1. This optimal limiting power exceeds a for h > 0 and approaches 1 as 
h —¥ oo. 

Now, consider the limiting power of Neyman’s smooth test with any fixed k 
against the same sequence of alternatives. By (14.40), if k is fixed, the limiting 
power against hn~ x ^ 2 of the test that rejects when S n ,k > Ck,i- a is M(k, h) given 
by (14.27). Lemma 14.3.1 implies that, for large k , the power of the test that 
rejects for large S n ,k is nearly a, against the sequence of alternatives defined by 
8\ = hn -1 ^ 2 . In other words, Neyman’s smooth test has poor power against such a 
sequence of alternatives, even though this family of alternatives is included in the 
original parametric model (14.35) leading to the derivation of the Neyman smooth 
tests. Moreover, one can show (Problem 14.32) that, assuming the conditions of 
Theorem 14.4.2, under 6\ = hn _1//2 , 


[ Sn,k n 


kn\/(2kn ) 1/2 4 N(0, 1 ) 


(14.51) 
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as n, k n —> oo. Thus, the limiting distribution of the normalized S n ,k n is the same 
under 9\ = 0 as under the sequence 9\ = hrC 1 ^ 2 . Hence, the limiting power is a 
against either sequence. 

In order for the limiting power to be nontrivial against local alternatives, it 
is necessary to consider alternatives that converge to Ho at a rate slower than 
the usual parametric rate n -1 ^ 2 . For example, let /„ be defined as in (14.43), 
but with b n not of the form /in -1 / 2 . By (14.44), if k is fixed, under f n , S n ,k is 
approximately distributed as Xki^k )i where 

k 

5 2 k = nb 2 n J2( T i’ u ) 2 ■ 

3 = 1 


But, 




N(n, 1 ) 


{2k) 1 / 2 

if <5 2 /(2 k) 1 / 2 —>■ /r as k —» oo. Therefore, one might expect that, under /„, 

S n ,k n k n d 


(2fc n ) 1 /2 


N(l*, 1 ) 


(14.52) 


if 

(2fc n )i/2 -> » ■ 

Now, if To, Ti, T 2 ,... form a complete orthonormal system for the space of square 
integrable functions on (0,1), then, 



1 


u 2 (x)dx . 


Therefore, if we take b n = (2fc n ) 1//4 /n 1 ^ 2 , we expect that (14.52) holds, where 



In fact, such a result is proved in Eubank and LaRiccia (1992) in the case Tj(x) = 
V / 2cos(7 xjx) if kn/n 2 —» 0. The conclusion is that Neyman’s test with increasing 
order k n has nonnegligible power against alternatives converging to the null at 
rate kl/ 4 /n}^ 2 . This result suggests that k n should not increase too quickly. 

Further theoretical results concerning Neyman’s smooth tests, especially in 
regard to the choice of k, can be found in Eubank and LaRiccia (1992), Led- 
wina (1994), Kallenberg and Ledwina (1995), Fan (1996) and Inglot and Ledwina 
(1996). This growing literature includes simulation studies which show that Ney¬ 
man’s smooth tests perform well across a broad range of alternatives and are 
competitive with existing tests. 


14.5 Weighted Quadratic Test Statistics 

In the construction of Neyman’s smooth tests based on k, equal weight was given 
to the first k directions determined by the orthonormal functions Xi,T 2 ,. 
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Instead, one might consider modifying the test statistic so that different weights 
are given to different directions; with such a modification, it becomes possible to 
consider an infinite number of directions. Such weighted quadratic test statistics 
are considered in this section. 

Under the setup and notation of Section 14.4, consider the problem of testing 
the simple null hypothesis Ho : P = Po- Let To = 1 and suppose To, Ti,, T 2 ,... 
is an infinite sequence of orthonormal functions on T 2 (Tb). Let Z n j be defined 
by (14.38) and consider the test statistic 

OO 

W n = ajZnj , (14.53) 

1=1 

where aj is a sequence of nonnegative numbers. Typically, we would choose aj 
to decrease with j, so that less weight is given to the j'th component making up 
W„. Note that W„ is only computable if only finitely many aj are nonzero, or - 
as will be exemplified later - the infinite sum can be explicitly evaluated by an 
alternative computable formula. 

Let Fw n denote the c.d.f. of W„ under Po, and set 

w n ,i-a = inf{x : Fw n (x ) > 1 — a} . 

The following result summarizes some basic properties of W n . 

Theorem 14.5.1 Assume aj > 0 and JT aj < 00 . 

(i) Under Ho, W„ is a well-defined random variable; that is, W n < 00 with 
probability one. 

(ii) Under Ho, 

OO 

W n 4 W=J2 a i Z f , 

1 = 1 

where Zi,Z 2 ,... are i.i.d. 1V(0,1) random variables, and W has a continuous 
distribution function Fw which is strictly increasing on (0,oo). 

(in) Let wo-a denote the 1 — a quantile of the distribution of W, so that 

Fw(wi~ a ) = 1 — a . 

Then, w n ,i-a —> wi- a - 

(iv) Assume aj is such that aj > 0. Suppose P is any probability distribution such 
that 


E P [Tj(X 1 )] + Ep»\Tj(Xfi)\ (14.54) 

(where the expectation on the left side is assumed to exist). Then, the limiting 
power of the test that rejects when W n > w n (l — a) against the alternative P is 
one. Hence, if all the aj satisfy aj > 0, then the test is consistent in level against 
any P which satisfies (14-54) for some j. 

Proof. First, note that 

OO 0000 

0 < Ee 0 (W n ) = ^ajVare 0 (Z n ,j) < ^ajE So Tj(X 1 ) = ^ aj < 00 . 

3 =1 3 =1 3 =1 
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Part (i) follows, since a nonnegative random variable with a finite mean is finite 
with probability one. To prove (ii), first note that W is a well-defined random 
variable since 

OO 

E{W) = Y, a *< °° • 

5 =l 

Now, let 

W^=±a jZ ]. 

3 =1 

Then, W (k) 4 IT as k -*• oo. Indeed, 

OO 

0 < IT — IT (fc) = ^ OjZ,- 4o 
i=fc+i 

since, by Markov’s Inequality (Problem 11.26), for <5 > 0, 

p{w - w (k) >s}< E{ -^ = ^J= k+1 aj o 

o o 

as A: —> oo. Moreover, the distribution of IT is continuous and strictly increasing 
(Problem 14.33). To show that IT n converges in distribution to IT, write 

Wn = Wl k) + R (k) , 

where 

k 

Wi k) =J2 ajZ 2 nJ ■ 

3 = 1 

For any fixed k, the Multivariate Central Limit Theorem yields 
{Zn,l, • • •, Z n} k) —t (Zi ,..., Zk) ■ 

By the Continuous Mapping Theorem, 

P{W„ <t}< P{W^ k) < f} -> P{W (k) < t} . 

Therefore, for any k, 

limsup P{IT„ <t}< P{W (k) < 1} 

n 

and so 

limsup P{W n <t}< lim P{W (k) <t} = P{W < t} . (14.55) 

n k—¥o o 

Similarly, for any 5 > 0, 

P{w n <t}> P{W n < t, R (k) <8}> P{wi k) <t-S, R ( n k) < 6} . 

Using the general inequality P(AB) > P(A ) — P(AB C ) yields 
P{IT„ <t}> P{W^ k) <t — 8} — P{R (k) > <5} . 



610 


14. Testing Goodness of Fit 


But, by Markov’s Inequality, 

OO 

P{R ( n k) >S}< 5~ 1 E(R^) < 5- 1 ■ 

j=k+1 

Hence, for any S and k, 

OO 

P{W n <t}> P{Wi k) < t - 5} - cT 1 J2 a i 

j=k+l 

and so 

OO 

lim inf P{W n <t}> P{W (k) < t - <5} - <5 _1 V a, . 

j=k +1 

Now, let k — > 00 to conclude 


lim inf P{W n <t}> P{W <t-S} . 

n 

Letting 5 —> 0 and using the continuity of the distribution of W, we conclude 

lim inf P{ W n < t} > P{W < t} (14.56) 

n 

Combining (14.55) and (14.56) yields (ii). 

Part (iii) follows from Lemma 11.2.1. To prove (iv), suppose j is such that 

EplT^X^^Ep^TjiX,)} . 

By the Law of Large Numbers, 

1 n 

-J2Tj(Xi)Ep[T j (X 1 )} 

1 i=1 

and so 

1 n 

I Z n ,j I = |n 1/2 • - ^{T.-pC) - E Po [T 3 { AL)]}| 4 00 . 

i =1 

Therefore, 

P{W n > w„(l — a)} > P{djZnj > w(l — a)} — > 1 . ■ 

Note that the conclusion (iv) holds if the critical value of the test w„,i- a is re¬ 
placed by wi- a - Using either critical value results in a test that is asymptotically 
consistent in level. Of course, one can achieve exact level a if Ew n is not contin¬ 
uous by rejecting Ho if W n > w n ,i- a and possibly randomizing if W„ = w n ,i-a- 
But, the above result also implies W n = w n ,\- a with probability tending to 0. 

Thus, we can conclude that the test that rejects for large W n is consistent in 
power against a broad family of alternatives. Indeed, for a given set of orthonor¬ 
mal functions Ti, T 2 ,..., let fi*, denote the family of densities (14.35) with k fixed. 
Let W n be of the form (14.53) with positive, summable weights dj. Then, the test 
that rejects for large W„ is consistent in power against any P ^ Pq in (JfcLi 
Actually, letting Q' k denote the family of distributions P such that 

E P [Tk(Xi)\ Ep 0 [Tk(X 1 )] . 
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Then, the test is consistent in power against any P in Tn contrast, 

Neyman’s smooth tests are consistent in power against fife and Uj=i , where 
k is fixed. 

For example, for testing uniformity using the normalized Legendre polynomials 
Ti, T 2 ,..., the test that rejects for large W n is consistent in power against any P 
that is not the uniform distribution, since P and the uniform distribution cannot 
have the same sequence of moments. 


Example 14.5.1 (The Cramer-von Mises Test) Let Xi,..., X n be i.i.d. 
real-valued random variables with c.d.f. F. For testing F = To, the Cramer-von 
Mises statistic is given by 


CL — 71 


/ oo 

[Fn (x) 

-CO 


- F 0 (x)] 2 dF 0 (x) 


(14.57) 


where F n (x) is the empirical c.d.f. 




The distribution of CL under To is the same for all To which are continuous 
(Problem 14.34). Hence, we now assume that Fo(x) = x. Now, CL can actually 
be represented as a weighted quadratic test statistic W n with 


Tj(x) = V2cos(njx) 


i = 1,2,... 


and a,j = l/(n 2 j 2 ). To see this, note that the functions v / 2sin(7r jx), j = 1, 2,... 
form an orthonormal basis of the space I/ 2 [ 0 ,1], the (equivalence class of) func¬ 
tions that are square integrable on [0,1] (see Section A.3). By Parseval’s formula 
(A.7), it follows that 

00 r 1 

C„ = / [F n (x) — x\V 2 s\n(-xjx)dx } 2 . 

3 =1 Jo 

By integration by parts (Billingsley (1995), Theorem 18.4), 

[ [F n (x) — x]V2 sin(njx)dx = —\ [ \/2 cos(njx)d(F n (x) — x) 

Jo n J Jo 

— 1 Z* 1 I n 7 

= —7 V2 cos{njx)dF n (x) = -— V'Tj(Xj) =- . 

nj J 0 Tvjn njn 1 ' 2 


Hence, 


00 i 

■ 

f ^ 'TV* 1 


3 =1 


ti ■‘‘3 


7 

2 n,j , 


as required. 

By Theorem 14.5.1, it follows that, under the null hypothesis, 


CL 


1 

V" 1 7 2 

n 2j2 3 ' 
.7=1 J 
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where Z \, Z 2 ,... is a sequence of i.i.d. standard normal random variables. It also 
follows that the test is pointwise consistent in power against any alternative c.d.f. 
F for which 

Ep[Tj(X 1 )] = f V2cos(njx)dF(x) ^ f V2cos('xjx)dFo{x) = 0 
Jo Jo 

for some j. But, 


cos(7 vjx)dF(x) = 0 for all j = 1, 2, 


implies F = Fo (Problem 14.36), and so the test is pointwise consistent in power 
against any F ^ Fq. ■ 


Example 14.5.2 (The Anderson-Darling Test) As in Example 14.5.1 for 

testing F(x) = Fq(x) = x, consider the Anderson-Darling statistic defined by 


A n = n 


nl [E„(x) - xf 


dx 


(14.58) 


x(l — x) 

It can be shown (Problem 14.37) that A n has the form (14.38) of a weighted 
quadratic test statistic with 

1 


j(J + 1) 


and Tj(x) the jth normalized Legendre polynomial on [0,1] (used in Neyman’s 
original proposal of Neyman’s smooth tests; see Section 14.4). Thus, 


OO 1 

An = ^2 jjj— jy Z n j , (14.59) 

(while Neyman’s test corresponds to ^ 2 k =1 j). It then follows that, under 


A„ 


E 

3=1 


j{j + 1 ) 


In fact, many test statistics defined by an integral of the form 

U 2 (x)dx 

can be rewritten in the form of a weighted quadratic test statistic. A general 
treatment of such integral tests of fit can be found in Chapter 5 of Shorack and 
Wellner (1986); also, see van der Vaart and Wellner (1996). 


Theorem 14.5.1 considered the behavior of a general weighted quadratic test 
under the null hypothesis P = Pq and under a fixed alternative. Next, we would 
like to consider the behavior of W n under a sequence of local alternatives P n . 

Suppose P n has density p n and Po has density po with respect to some common 
measure p. Consider the likelihood ratio based on n i.i.d. observations Xi ,..., X„ 
given by 


L n — Ln {A 1, . . . , A n) 


n SUM*) 

nr=iPo(x i ) • 



14.5. Weighted Quadratic Test Statistics 613 


Assume, under Po, 


" 2 

log(L n ) = n~ 1/2 J2 v{Xi) - + o P n (1) , 

i=l 


where 


(14.60) 


Ep 0 [f>(X i ))= 0 

and 

0 < E Po [ff{Xi)\ = a 2 < oo . 

Then, the Central Limit Theorem implies that, under Pq, 

log(P„)4lV(-^,a 2 ) 

and {P™} and {To 1 } are contiguous (by Corollary 12.3.1). Furthermore, under 

Po, 

n 

Z nJ = n~ 1/2 E Tj(Xi) 4 N(0, 1) . 

i= 1 

By the bivariate Central Limit Theorem, under Po, {Z n ,j, \og(L n )) is asymptoti¬ 
cally bivariate normal with asymptotic covariance 

Cj = Cov Po [Tj( X 1 ),f\{X 1 )\ . (14.61) 

It follows from Corollary 12.3.2 that, under P n , 

Zn,i 4 N (Cj , 1) . 


Similarly, for any fixed integer k and constants ot\,,ak, under Po, 

k n k k 

E a i z ”-i = n ' 1/2 E E otjTiiXi) 4 JV(0, E «?) 

3=1 *= 1 3=1 3=1 

and 

k 

4 ' a jZ n ,j, log(Pn)) 

3 = 1 


is asymptotically bivariate normal with covariance 

k k k 

Covp 0 (J2^Z n j,\o g (L n )) = Covp 0 (^2 a jTj{Xi), fj(Xi)) = E a l c l ■ 

3=1 3=1 3=1 


Hence, under P„, 


k 

El a jZn,j 

3=1 


k 

4 N(J2aj Cj ,l) , 

3 = 1 


again by Corollary 12.3.2. By the Cramer-Wold device, it follows that, under P n , 

(■^n,l > • • • 5 Z n ,k) —^ + Cl, . . . , Zk + Cfc) , 


(14.62) 
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where Z\,... , Zk are i.i.d. N(0, 1). This suggests that, under P n , 

OO 

w n AJ2^(z j +c j ) 2 . 

3 = 1 

In fact, the following result is true. 

Theorem 14.5.2 Let W n be defined by (14-38) with cij > 0 and 

OO 

E < 00 ’ 

3 = 1 

(i) Assume, based on n i.i.d. observations from P n , for any k, 

{Z n , i, ■ ■ ■, Z n ,k) —t (Z\ + ci,..., Zk + Cfc) , (14.63) 

where Z\, Z-i,... are i.i.d. N(0, 1). If a,jC 2 < oo, then 

OO 

IT n 4^a,(Z, + c,) 2 . (14.64) 

3=1 

(ii) If P n is such that the loglikelihood ratio L„ satisfies (14-60), then, under P n , 
(14-63) holds withcj given by (14-61). Furthermore, ^2 .ajC 2 < oo and so (14-64) 
holds as well. 

Proof. The proof of (i) is a straightforward generalization of Theorem 14.5.1. 
(Note that it can be generalized further in that the Z n ,j need not be a normalized 
average and the Z :] need not be normal nor independent.) To prove (ii), note that 
(14.63) holds by the discussion leading to (14.62). Moreover, 

OO OO 

E a i° 2 i = E a i' Coy Po [TAXifiviXi)] 

3=1 3=1 


OO oo 

<Y,a3 Va rp 0 [TA x i)] v arp 0 [fj(,Xi)] = Var Po [f}(Xi)] ■ E aj < oo . 

3=1 3=1 

Hence, the condition (14.63) in (i) holds. ■ 


Example 14.5.3 (Limiting Power Calculation) As in Example 14.4.4, let 
fn(x) be given by (14.43) with b„ = hn~ x ^ 2 . As noted in Example 14.4.4, under 

fn. 


(Z„,!,..., Z„, fc ) T 4 N(c,I k ) 


where c has Jth component Cj = h{Tj, u). Note that 


E a E ^ h2 


f u 2 (x)dx El 
Jo 


aj < oo 


Therefore, by Theorem 14.5.2, (14.64) holds. ■ 
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Assume the hypothesis in Theorem 14.5.2 (ii). Let u>i_ Q be the 1 —a quantile of 
the limit distribution under the null hypothesis. Then, the limiting power against 
P n is given by 

OO 

Pi^ajiZj +cjf > wi_ a } • (14.65) 

l=i 

If there exists a nonzero Cj for which aj > 0, then (14.65) exceeds a (Problem 
14.41). For example, if aj > 0 for all j, then the requirement is that there exists 
some j for which Cj is nonzero. But, this must be the case if 1, Ti, Th,... form an 
orthonormal basis for L 2 (Po), because Parseval’s identity implies 

OO 

0 < Varp 0 [? 7 (Xl)] = ^ c? . 

l=i 

It follows that not all Cj can be 0. 

Thus, unlike Neyman’s smooth test with k„ —¥ oo, the limiting power for W„ 
is nontrivial against certain contiguous alternatives, and so it appears that tests 
based on W n are better at detecting alternatives that are close to Ho- However, we 
now show that the limiting power of W n can be a against a contiguous sequence 
of alternatives. 

Example 14.5.4 (Another Local Power Calculation) Let 

Tj(x) = \Z2cos(njx) . 

Set pe(x) = C(6) exp [9Tb(x)\. If B is fixed and large, the limiting distribution of 
W n against 9 = hn~ 1 ^ 2 is given by the distribution of ub{Zb + h) 2 . Since ub —> 0 
as B —> oo, it follows that 

aB{Zs T h) —i 0 

as B —» oo. Therefore, the limiting power against such a sequence is small. In 
order to obtain a limiting value of a, let 

f„(x) = Cn(9) exp[9T n (x)] . (14.66) 

Then, if 9 = hn _1 ^ 2 , the limiting power of the test based on W„ against such a 
sequence is a, even though P™ is contiguous to Pq, where P n is the distribution 
with density /„ when 9 = hn _1 ^ 2 (Problem 14.39). ■ 

A difficulty in applying a weighted quadratic test statistic is the computation of 
critical values and power. Of course, one may resort to Monte Carlo simulation of 
the null distribution. Alternatively, the representation of the limiting distribution 
as that of 

OO 

W = J2 a j(Zj + c j) 2 (14.67) 

l=i 

can be useful. For example, the null distribution (in the case Cj = 0) has 
characteristic function 

OO 

C wit) = ]^[(l - 2 iajt)~ 1/2 
l=i 
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(Problem 14.40). In the special case of the Cramer-von Mises test, Smirnov 
inverted (see Durbin (1973)) and obtained 


. OO /*4j 2 7T 2 

P{W >x}= -V(-l) j+1 / 

n Jw- 1) 2 - 2 y 


-yJV , xy. 

■ 7 i —\ exp(—— )dy . 
sm {s/y) 2 


Alternatively, one may truncate the series (14.67) to a finite sum and use nu¬ 
merical methods; see Durbin and Knott (1972). Another possibility is to match 
moments of IF to a Pearson family of distributions, as done by Stephens (1976). 

Some numerical power comparisons between competing goodness of fit tests 
can be found in Durbin and Knott (1972) and Stephens (1974), where both the 
Anderson-Darling and Cramer-von Mises statistics outperform the Kolmogorov- 
Smirnov test. A further comparison is presented in D’Agostino and Stephens 
(1986), Section 8.14. However, Example 14.5.4 shows that tests based on weighted 
quadratic statistics W n can have poor power against higher frequency alterna¬ 
tives, such as (14.66). In the case of the Cramer-von Mises statistic and the 
Anderson-Darling statistic, this can be explained by the rapid downweighting of 
the cij. Moreover, several simulation studies have demonstrated that Neyman’s 
smooth tests can outperform tests based on W n over a wide range of alterna¬ 
tives; see Miller and Quesenberry (1979), Rayner and Best (1989) and Eubank 
and LaRiccia (1992). In summary, both Neyman’s smooth tests and weighted 
quadratic tests offer viable approaches to testing goodness of fit, but neither 
approach is asymptotically uniformly optimal. Unfortunately, we will see in the 
next section that no test can perform uniformly well against local or contiguous 
alternatives when the family of possible alternatives is large. 


14.6 Global Behavior of Power Functions 

For testing uniformity, the Kolmogorov-Smirnov and the weighted quadratic tests 
such as the Cramer-von Mises test are consistent in power against any alternative. 
Even the Chi-squared test with a finite number of partitions and the Neyman 
smooth tests with finite k are consistent in power against a broad range of alter¬ 
natives. However, as we will see in this section, the power of any goodness of fit 
test is poor against a local sequence of (contiguous) alternatives, except possi¬ 
bly in a finite (bounded) number of directions, even with increasing sample size. 
Such a statement is not surprising for Neyman’s smooth tests with k fixed, since 
then only a finite number of orthogonal directions are used. While a quadratic 
test statistic gives positive weight to infinitely many components, the weights dj 
satisfy JV a j < °°i this condition evidently entails 

OO 

a i <e 

j=k+1 

for large enough k, so that the test essentially only uses a finite number of direc¬ 
tions as well; roughly, the test behaves similar to the corresponding test obtained 
by summing over only the first k components. (For a rigorous statement, see 
Milbrodt and Strasser (1990, Remark 2.6) and Janssen (1995).) Thus, while con¬ 
sistency may hold against any fixed alternative as n —¥ oo, there remains the 
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possibility that, for any fixed sample size n, any test will perform poorly against 
a broad range of alternatives. Moreover, one cannot simply increase k to obtain 
power against a broader family of distributions. As we saw in the case of the Chi- 
squared test of uniformity with k + 1 cells, while increasing k increases the set 
of consistent alternatives, it will decrease the limiting power against contiguous 
alternatives. Roughly speaking, we will see that one can only obtain reasonable 
power locally across a family of distributions of fixed bounded dimension. 

In order to make this precise, first consider the following normal model, which 
arises as the limiting experiment for testing goodness of fit in Section 14.4. The 
argument leading to the optimality result (14.42) was based on the fact that, 
for the parametric model Pg of densities pg given by (14.35), the experiment 
{P^ n - 1 / 2 } is (locally) asymptotically normal at #0 = 0 , where the limit experi¬ 
ment {Qh} consists of observing Z T = (Z\ ,..., Z k ) and the Zi are independent 
with Zi ~ N(hi, 1). In this model, for testing h = 0 against \h\ > b, the max- 
imin test rejects when X^=i Z? > c k,i-a- The maximin power of this test over 
alternatives \h\ > b is given by the right side of (14.42), which is denoted by 

M(k,b) = P{ X l(b 2 ) >c M _ a } . 

By Lemma 14.3.1, M(k,b) —» a as k —» 00 . Thus, in the limiting normal ex¬ 
periment with k large, one cannot test h against \h\ > b uniformly well in all 
directions. To put this another way, consider the r^-dimensional subspace 14 
of IR fc which, without loss of generality, we take to be spanned by the first rk 
axes of the original fc-dimensional space. Then, the maximin power against al¬ 
ternatives in 14 - with 5 ^i=i = b 2 is attained by hi = • • • = h rk = b/r k and 

hr k + 1 = ■ ■ ■ = hk = 0. The same argument used in Lemma 14.3.1 shows that the 
maximum power will tend to a if r k —> 00 . Therefore, in order for the power to 
be bounded away from a as k —» 00 , we must require r k bounded as k —> 00 . 
Thus, one cannot expect to construct tests with high power, except possibly in a 
finite-dimensional subspace. This point was made clear by Janssen (2000a), who 
provided more specific bounds on the dimension of the subspace. We now develop 
his results. 

Lemma 14.6.1 Suppose Zi,...,Z k are independent with Zi distributed as 
N(hi, 1). Here, the parameter (hi,...,hk) varies in IR fe . Consider testing the 
null hypothesis that hi = 0 for all i, against the alternative that not all the hi 
are 0. Let <j> = rf)(Z 1 ,..., Zk) be any test with Eo(<j>) = a. Define d to be the unit 
vector in IR fc with 1 in the ith component and 0 in the other components. Then, 
for each H > 0, 

k 

^2 [sup \Etefifi) - a\ : |i| < H] 2 < a(l - a)(exp(H 2 ) - 1) . (14.68) 

i= 1 

Proof. The function 


gfit) = \EteM) -«l 

is continuous on t £ [— H,H ], and so it attains its maximum at some point ti. 
Let 

t 2 

Yi = exp (UZi - -4-) - 1 . 
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Using the fact that E[exp(tZ)] = exp(t 2 /2) if Z is N(0, 1) yields Eo(Yi) = 0 and 
Varo(Yi) = exp(tf) — 1 < exp(H 2 ) — 1 . 

Let <p denote the standard normal density. Then, the point of introducing the Y t 
is that 

/ k 

• • •, Zk)<p(zi - ti) (p(zj) dzi 

jy6i i= 1 

k 

= j 4>(z i, • • •, z k ) ' II <p{zi)dzi 



, Zk) exp (tiZi 


t 2 k 

^)U^{z i )dz i = E 0 [<j>{Z 1 ,...,Z k )Y i \ 

i=1 


and so 


E ti e i [(j)(Z 1 ,Z k )] - a = Covo(4>, Yi) . 


Define 


A 


Couo(iMi) 
Var 0 (Yi ) 

0 


if Varo{Yi) > 0 
otherwise. 


(14.69) 


Note that, if ti ^ 0, then Varo(Yi) > 0; if U = 0, then Yi — 0 and A = 0. Define 
<j> by the relation 


k 

<t>(Zi,. .., Zk) — a = YJM + (j> , 

i =1 

so that 

Eo{4>) = 0 , Eo(4> 2 ) < oo 

and 

Covo{<j>, Yi) = 0 i = 1 ,... n . 

This implies <j> is uncorrelated with </> — <j>, and so 

Var 0 (4>) = Var 0 (4> + <j> - <j>) = Var 0 (<j>) + Var 0 {<t> - 4>) ■ 


Therefore, 

Also, 


Varo(<j> — (j>) < Varo(rf)) = Eo{(j> 2 ) — a 2 < a( 1 — a) . 


YtfVaroiYi) = Var 0 (J2 A*i) = Var 0 {</> - 4>) < a(l - a) . (14.70) 


But, 


Et iei (4>) - ce = PiVar 0 (Yi) 
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implies 

\Et t eM - a\ 2 < tfVar 0 (Yi) ■ Var 0 (Yi) < 0 2 Var o (Yi)(e X p(H 2 ) - 1) . 
Summing over i and using the bound (14.70) yields the result. ■ 

Notice that the bound on the right side of (14.68) does not depend on k, the 
dimension of the parameter space. In fact, the same bound holds for tests based 
on an infinite sequence Zi, Z 2 ,.... In order to avoid certain technical aspects of 
likelihoods on infinite product spaces, we restrict attention to the case of k finite. 

We now use the previous lemma to show that, for the normal testing problem 
studied in Lemma 14.6.1, the power of any level a test is poor, except possibly 
on a restricted range of alternatives. Thus, for fixed large k, it is impossible to 
construct a test that has high power in all directions (which certainly implies the 
same conclusion for any larger k or when k = 00 ). The following notation will be 
used. For a set V in IR fc , let V ± be defined as 

V ± = {a: : (x, v) = 0 for all v € V} . 


Theorem 14.6.1 Suppose Zi,...,Zk are independent, with Zi normally dis¬ 
tributed with mean hi and variance one. The parameter h = (hi ,..., hk) T varies 
freely in IR fc . For testing h = 0 versus h ^ 0, let <f> = (j>(Zi,..., Zk) be any test 
with Eo((f>) = a. Fix any t and any H > 0. Assume 

k > 1 + e _ 1 a(l — a)[exp(77 2 ) — 1] . (14.71) 

Then, there exists a linear subspace V, whose dimension d is independent of k 
and 4>, such that 

sup{| E h (<)>) - a| : h £ V ± , \h\ < H} < t (14.72) 

and 

d < 1 + e _ 1 a(l — a)[exp (FI 2 ) — 1] . (14.73) 

In words, the power of <j> is poor on V ± : l^-l — #}• 

Proof. Let Vo = {0}. We will inductively choose linear subspaces V n = 
span{vi,... ,v n j of IR fc as follows. Given Vi,... ,v n , let v n +i be orthogonal to 
Vi,... ,v n and satisfy |v n +i| = 1 and 

[sup \Et v (<j>) - q| : \t\<H, ve V4 X , M = l] < \E tn+1 v n+1 (<t>) - a\ 2 + . 

Let bn+i = \Et n+1 v n+1 (<j>) — ct\ 2 . Choose m to be the smallest positive integer 
satisfying 

b m + ^ < e . (14.74) 

To see that such an m exists and m < k, note that Lemma 14.6.1 implies (possibly 
after an orthogonal transformation) that 

k 

( 6 " + f) - “ a )I ex P( H2 ) -!]+«• 

n+1 
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But, the assumption on k implies 

a(l — a)[exp(77 2 ) — 1] 1 


ek 


+ k <1 


which implies 


1 y- ( h , « \ ^ a(l - a)[exp(H 2 ) - 1] e ^ 

k2^V n+ 2™)~ k + k 


Hence, there exists such an m with m < k. Let V in the statement of the theorem 
be Vm-i- Then (14.72) is satisfied because m satisfies (14.74). Moreover, since 

b i + ^ > e for j = 1 ,..., m - 1 , 

we have 

m— 1 

(m - l)e < (bj + < «(1 - a)[exp{H 2 ) - 1 ] + e , 

3=1 


where the last inequality follows from Lemma 14.6.1. Therefore, 
m — 1 < 1 + e _1 a(l — a)[exp (H 2 ) — 1] . ■ 


The point of Lemma 14.6.1 and Theorem 14.6.1 is that one cannot have high 
power uniformly in all orthonormal directions. This is not particularly surprising 
given that there are k observations and k parameters. Nevertheless, the statisti¬ 
cian must then implicitly or explicitly construct a test so that the power is high 
in certain important directions. 

We can obtain analogous results for the problem of testing P = Pq based on n 
i.i.d. observations from P. Even with increasing n, the total amount of squared 
power greater than a of any test (sequence) is bounded. 


Theorem 14.6.2 Let Xi ,..., X n be i.i.d. Pg, where Pg has density pg given by 
(14.35) with 6 £ IR fc . For testing 6 = 0 versus 9^0, let <j>n = <t>n(X 1 ,..., X n ) 
be any level a test. Fix e > 0 and H > 0, and assume k satisfies (14-71). Then, 

(i) 


K r 

limsupy sup \E te . n -i/ 2 ((j> n ) - a| : \t\ < H 


(14.75) 


< a(l — a)[exp(i7 2 ) — 1] . 

(ii) There exists a subspace V of IR fc whose dimension d satisfies (14-73) 
(independent of k) such that 

limsupsup{|i? hn -i/2(0n) — a\ : he V ± , \h\ < H} < e . (14.76) 

n 

Proof. The sequence of models Pf _ l/2 is asymptotically normal with identity 
covariance matrix Ik, in the sense of Definition 13.4.1. Indeed, the family is an ex¬ 
ponential family and hence is quadratic mean differentiable. In fact, as previously 
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pointed out, the score vector for this model is given by (14.37) and is asymptoti¬ 
cally multivariate normal with mean 0 and identity covariance matrix. The proof 
then follows from Theorem 13.4.1, which compares the limiting power of any test 
sequence with that of a test for the normal model studied in Lemma 14.6.1. For 
the limiting normal experiment, an upper bound for the sum of squared powers 
is given in Lemma 14.6.1, and so this bound must hold asymptotically. Similarly, 
(ii) follows by Theorem 14.6.1. ■ 


Of course, the theorem has implications for testing P — Pq against alternatives 
outside the parametric model (14.35). Indeed, since the right side of (14.75) does 
not depend on k, we may take k = oo on the left side and obtain the same 
result. That is, the squared infinite sum of deviations of power from a remains 
bounded. We have stated the result first for finite k since our proof then only 
requires convergence to a normal experiment in a finite dimensional space (as we 
have not considered infinite dimensional spaces). 

In fact, Janssen (2000a) shows that this result holds for each n as well; that is, 
one can simply delete the limsup in (14.75). Thus, the power of any test sequence 
is essentially flat outside a space of dimension d, where d does not depend on n. 

To explain the result a little further, fix 9 £ IR fc and consider the one¬ 
dimensional model indexed by t with density pte defined in (14.35). If we know 
that the actual distribution belongs to this one-dimensional exponential family 
submodel for some t > 0, then a UMP level a test sequence exists for testing 
t = 0 against t > 0 , which we now denote by <j>g = {<(>* 0 }; moreover, 

l[mE te n - i/ 2 W>n,e) = 1 - $(zi- a - t\9\) (14.77) 

n 

(Problem 14.42). We will now connect the performance of an arbitrary test se¬ 
quence (p = {<j>n} with the notion of asymptotic relative efficiency, as developed 
in Section 13.2. Let 9, a, (3) be the smallest sample size required to achieve 
power at least /3 if the true density is pte- In the case of (j>g, it follows from (14.77) 
(or Theorem 13.2.1 (iii)) that, if \9\ = 1, 

lim t 2 N,i,*(t,9,a, (3) = ( z a - zpf . (14.78) 

t-> o+ B 

With a and (3 fixed, choose any small <5 > 0, any e satisfying 0 < t < (3 — a and 
H > 0 large enough so that (z a — zp) 2 /H 2 < <5. For an arbitrary test 4 i>, Theorem 
14.6.2(h) implies that there exists V C IR fc of dimension d satisfying (14.73) such 
that, for all small t and 9 G V ± with \9\ = 1, the power function at t9 is bounded 
above by a + e < /?, at least for t such that tn 1 ^ 2 < H. This in turn implies that 
n must satisfy n 1 ^ 2 ! > H in order to achieve power /?; thus, 

lim inf t 2 N^it, 9, a, (3) > H 2 . (14.79) 

t—>o+ 


Combining (14.78) and (14.79) yields, for 9 £ V ± , 


lim sup 

t->-o+ 


N#* (t,9,oi,(3 ) 
N,/,(t,8,a,/3) 


< 


(z a - zp ) 2 

H 2 


< <5 • 


(14.80) 


If the limsup on the left side of (14.80) is replaced by a limit, which is shown to 
exist, the limiting value would be the Pitman ARE of </> with respect to <f>g for the 
submodel Pte. While we are not claiming such a limit exists, the interpretation of 
the result is the following. Except on a set of 9 values of dimension d (independent 
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of n and k ), the test 4 >e requires approximately no more than a small proportion 5 
of the sample size required by <j> to achieve power f3. Therefore, it is not possible 
to simultaneously have high power along all “directions” 9, at least from this 
local point of view. 

The possibility of high power for parameter values far from 0 (corresponding 
to |t| > H) remains however, and so this result does not contradict the uniform 
consistency result, Theorem 14.2.2, of the Kolmogorov-Smirnov test; there, the 
power tends to one against nonlocal alternatives. But, for testing goodness of fit 
against a broad nonparametric class of alternatives, Lemma 14.3.1 and Theorem 
14.6.1 imply that any test (sequence) performs well locally only in some fixed 
finite dimensional subset of alternatives, even as n increases. To put it another 
way, any test has a preferred set of alternatives (of bounded dimension) for which 
its power is locally high. Unfortunately, it may be difficult to analyze the pre¬ 
ferred alternatives for any particular test. For certain classes of tests, such as the 
integral tests of Cramer-von Mises or Anderson and Darling, there exist princi¬ 
ple component decompositions of the test statistics, which lead to useful power 
calculations; see Shorack and Wellner (1986), Chapter 5. For the Kolmogorov- 
Smirnov test, it is known that it is roughly speaking more powerful to deviations 
of the median; see Milbrodt and Strasser (1990) and Janssen (1995) for a more 
careful statement. Since any given test sequence can only perform well for some 
finite dimensional set of alternatives, it seems natural to design tests that per¬ 
form well on a given finite dimensional set, which is exactly the approach taken 
in the construction of Neyman’s smooth tests. A general theory of efficiency of 
goodness of fit tests is developed in Nitikin (1995), who also compares distinct 
notions of efficiency; also see Janssen (2003). Unfortunately, different efficiency 
notions give rise to different tests. It appears that a proper choice of test must 
be based on some knowledge of the possible set of alternatives for a given exper¬ 
iment. By restricting attention to families of densities with different degrees of 
smoothness, asymptotically maximin results have been obtained; see Ingster and 
Suslina (2003). 


14.7 Problems 

Section 14-2 

Problem 14.1 Verify (14.3). 


Problem 14.2 (i) Let Xi,, X n be i.i.d. real-valued random variables with 
c.d.f. F. Consider testing F = Fq against F ^ Fq based on the Kolmogorov- 
Smirnov test. Fix F with n 1 ' / 2 d_fc(F’, To) > Show that 

1V { 1 r Sn, l—a\ ^ 1 T7 . . 7 ~ 7VX TX • 

4|n 1 / 2 dif(P, Fo) — Sn,l-a| 2 

Hint: Use (14.6) and Chebyshev’s inequality. 

(ii) Derive the alternative lower bound to the power of the Kolmogorov-Smirnov 
test given by (14.8). Compare the two lower bounds. 
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Problem 14.3 For testing F = Fo, where Fo is the uniform (0,1) c.d.f., consider 
alternatives F n to Fo of the form 

F n (t) = {l-\ n )F 0 (t) + \ n G(t) , 

where G ^ Fo is some fixed distribution. Show that, if X n = AnF 1 / 2 , then the 
limiting power of the Kolmogorov-Smirnov test is bounded away from a if A is 
large enough. 

Problem 14.4 Suppose F n satisfies n 1 ^ 2 dx (F n , Fo) -4 0. For testing F = Fo at 
level a, show that the limiting power of the Kolmogorov Smirnov test against F n 
is no better than a. In the case that both F n and Fo are continuous, show that 
the limiting power is equal to a. 


Problem 14.5 (i) Suppose {Pe} is q.m.d. at do, where Pg is a probability distri¬ 
bution on IR with corresponding c.d.f. Fg. Show that there exists B = Bg 0 (h) < 
oo such that 

limsup nd 2 K {Fg 0+hn -i /2 , do) < Bg 0 (h) 

n 

and Bg 0 (h) — > 0 as h -4 0. 

(ii) Construct a sequence of probability distributions P n on the real line with 
corresponding c.d.f.s F n satisfying dic(F n , Fo) -4 0 but H(P n ,Po) is bounded 
away from 0, where H is the Hellinger metric. On the other hand, show that 
H(P n ,P 0 ) -4 0 implies d K (F n ,F 0 ) -4 0. 


Problem 14.6 Let Fo be the uniform (0,1) c.d.f. and consider testing F = Fo 
by the Kolmogorov Smirnov test. 

(i) Construct a sequence of alternatives F n to Fo satisfying n 1 ' 2 dK(F„, Fo) —> 6 
with 0 < 8 < oo such that the limiting power against F n is a, even though there 
exist tests whose limiting power against F n exceeds a. 

(ii) Construct a sequence of alternatives F n to Fo satisfying n 1 ^ 2 da:(Pn, Fo) -4 5 
with 0 < 5 < oo such that the limiting power against F n is one. 

[Hint: Fix 1 > 7 ™ > 0 with n 1 ^ 2 -y„ —> S > 0 and let F„(t) be defined by 


F n (t) 


if 1 < 7™ 
if 7n < t < 1. 


(14.81) 


Note that da {Fn, Fo) = 7 n by construction. Let Ui,... ,U n be i.i.d. according to 
the uniform distribution on (0,1), and let G n (t) denote the empirical c.d.f. of the 


Ui. Set 


f Ui if Ui > 7„ 
(7k if Ui < 7„, 


(14.82) 


so that Xi,... ,X n are i.i.d. with c.d.f. F n . Let F„(t) denote the empirical c.d.f. 
of the Xi. Argue that 


sup \F n (t) — t\ < max 

t 


sup | Gn(t) - £|,7k 
t 


p F n {T n > s n ,i_ Q } < P{n 1 / 2 sup|G„(f) -t\> S„,i_a} 

t 


and 
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if n 1 / 2 If S < si- a , then this last condition will be satisfied for large 
enough n. Finally, the last displayed expression equals a .] 

Problem 14.7 Let F be the family of distributions having density F' = f on 
(0,1) and let Fq = fo be the uniform density. Consider testing the null hypothesis 
that F — Fo based on the Kolmogorov Smirnov test. Show that, if dk(f,fo) is 
the sup distance between densities and 0 < c < 1, then, for every n, 

inf P F {T n > sn,i—a : F € F, d K (F',f 0 ) > c} < a . (14.83) 

Argue that the result applies if dx is replaced by the L 2 distance between den¬ 
sities. Hint: Consider densities of the form fe(t) = 1 + csin(27r6h). [Compare 
this result with Theorem 14.2.2. Ingster and Suslina (2003) argue that alterna¬ 
tives based on the sup distance between distribution functions are less natural 
than metrics between densities. This problem shows it is impossible for the 
Kolomogorv-Smirnov test to have power bounded away from a against such 
alternatives. In fact, this is true for any test; see Ingster (1993) and Section 
14.6. However, by restricting the family of densities to have further smoothness 
properties, Ingster and Suslina (2003) have obtained positive results.] 

Problem 14.8 Generalize Theorem 14.2.2 to any EDF test statistic of the form 
n 1 ^ 2 d(F 7l , Fo), if d is a metric weaker than the Kolmogorov-Smirnov metric dk in 
the sense 

d(F,G) < Cdx(F,G ) 

for some constant C. In particular, show the result applies to the Cramer-von 
Mises test. 

Problem 14.9 For testing the null hypothesis that X\,... ,X n are i.i.d. from a 
normal distribution with unknown mean fj, and unknown variance <j 2 , show that 
the null distribution of (14.13) does not depend on (/x, a) (but it does depend on 
n). Describe a simulation method to approximate this null distribution. How can 
you construct a test that is exact level a = 0.05 based on simulation? Generalize 
this problem to testing a general location-scale family. 

Problem 14.10 Suppose X\...., X n are i.i.d. with c.d.f F on the real line. The 
problem is to test the null hypothesis Ho that the Xi are uniform on (0, 9\ for 
some 9. Let 9 n = max(.Yi,..., X n ), and let F„ be the empirical distribution 
function. Let dx(F,G) be the Kolmogorov-Smirnov distance between F and G. 
Consider the test statistic 

T n = r^^ 2 dx(F n , F§ n ) , 

where Fe is the uniform (0,9) c.d.f. Under Ho, what is the limiting distribution 
of T n ? 

Problem 14.11 Let Xi, • • •, X n be a sample from the normal distribution with 
mean 9 and variance 1, with cdf denoted by Fg(-). Let "l?^) denote the standard 
normal cdf, so that Fe(t) = 4>(t — 9). For any two cdfs F and G, let ||F — G|| 
denote sup t |F(f) — G(t)|. Let 9 n be the estimator of 9 minimizing ||F n — Fa||, 
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where F n (t) = n 1 1 (X; < t) denotes the empirical cdf. In case you are 

worried about problems of existence or uniqueness, you may assume 9 n is any 
estimator satisfying 

II F n — Fg^ || < inf \\F n — Fo\\ + e„, 

where e„ is any sequence of positive constants tending to 0 . 

(i) Prove 9 n is a consistent estimator of 9. 

(ii) Suppose now the observations come from a cdf F, possibly nonnormal. The 
problem is to test the null hypothesis that F is normal with variance 1 against 
the alternative hypothesis that F is not. Consider the test statistic 

T n = inf ||Fn - -Fell- 
e 

Argue, if F is N(6, 1), then the distribution of T n does not depend on 9. 

(iii) If F is not normal with variance one, argue that T n tends in probability to 
the constant 7 f = inf# ||.F — Fg ||, and 7 f > 0. 

(iv) Find a sequence of constants c n so that the test that rejects iff T n > c n has 
probability of a Type I error tending to 0, and has power tending to one for any 
fixed alternative F. Hint: Use the Dvoretzky, Kiefer, Wolfowitz Inequality. 


Section 14-3 

Problem 14.12 (i) Verify (14.19). 

(ii) Verify (14.20). 

Problem 14.13 Prove the convergence (14.21). 


Problem 14.14 In the multinomial goodness of fit problem, calculate the 
Information matrix I(p) given by (14.22). 

Problem 14.15 Prove part (iii) of Theorem 14.3.1. 

Problem 14.16 Show that the result Theorem 14.3.2 (ii) holds for the likelihood 
ratio test. 

Problem 14.17 Prove Lemma 14.3.1 (i). 

Problem 14.18 Recall M{k,h) defined by (14.27) and let Fy. denote the c.d.f. 
of the central Chi-squared distribution with k degrees of freedom. Show that 

h 2 

M{k , h) = a + 7 fc — + o(h 2 ) as h —> 0 , 

where 

7 k = Fk(ck,l-a) — Fk + 2 {ck,l-a) ■ 

Problem 14.19 As in Section 14.3.2, consider the Chi-squared test for testing 
uniformity on (0,1) based on k + 1 cells; call if fc . Fix any B < oo and e > 0. 
Let Ub be the set of u with f u = 0 and J u 2 < B. For alternative sequences of 
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the form (14.25) with 6„ = n 1 ^ 2 , show that, if k is large enough (but fixed), 
then 

limsup sup Ef n {(j>n,k) < a + e . 

n u:u&4b 


Problem 14.20 Verify (14.33). 

Problem 14.21 Under the setup of Problem 12.61, determine a Chi-squared 
test statistic, as well as its limiting distribution under the null hypothesis. [For 
a discussion of the Chi-squared test for testing independence in a two-way table, 
see Diaconis and Efron (1985) and Loh (1989).] 

Problem 14.22 The Hardy-Weinberg law says the following. If gene frequencies 
are in equilibrium, the genotypes AA, Aa, and aa occur in a population with fre¬ 
quencies 9 2 , 26(1 — 9), and (1 — 9) 2 . In an i.i.d. sample of size n, with each outcome 
being an AA, Aa, or aa with the above probabilities, let X \, X 2 , and X 3 be the 
observed counts. For example, X\ is the number of trials where the observation 
is AA. Note that X\ + X 2 + X 3 = n. The joint distribution of (Xi, X 2 , X 3 ) is a 
trinomial distribution. Hence, 

Pe{X r = xi,X 2 = X2,X 3 = * 3 } = , , (d 2 ) xi [26(1 - 0 )] x2 [(1 - 6) 2 f 3 

Xl\X2'-X3'. 

for any nonnegative integers X\, X 2 , and *3 summing to n. Find the MLE and 
its limiting distribution (suitably normalized). Derive the likelihood ratio and 
chi-squared tests to test the Hardy-Weinberg law. 

Problem 14.23 In Example 14.3.1, verify (14.32) and determine the MLE /3 n 
for the linkage submodel being tested. Determine the limiting distribution of the 
Chi-squared statistic Q n (f5 n ). 

Problem 14.24 Consider the limit distribution of the Chi-squared goodness-of- 
Ht statistic for testing normality if using the maximum likelihood estimators to 
estimate the unknown parameters. Specifically, suppose X\,..., X n are i.i.d. and 
the problem is to test whether the underlying distribution is N(9, 1) for some 
6 . Group the observations into just 2 groups: positive observations and negative 
observations. Derive the limit distribution of the Chi-squared statistic using the 
sample mean to estimate 6 and show it is not Chi-squared. 


Section 14-4 

Problem 14.25 Let Xi,..., X n be i.i.d. F, and consider testing the null hy¬ 
pothesis that F is the uniform (0,1) c.d.f. For 6 = (61,62) £ HR 2 , consider a 
family of alternative densities of the form 

Pe(x) = C(9) exp[9iTi(x) + 0 < x < 1 . 

Assume this two-parameter exponential family is well-defined for all small enough 
\0\, so that the family is a full rank exponential family which is q.m.d. at 6 = 0 
with Information matrix at 6 = 0 denoted by /(0). For the submodel with 62 = 0, 
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what is the optimal limiting power for testing 9\ = 0 against 9\ = hn~ x ^ at level 
a. Similarly, with 9 1 = 0, what is the optimal limiting power for testing 92 = 0 
against #2 = hn _1//2 . Prove that no level a test sequence exists whose limiting 
power simultaneously achieves these optimal values. Hint: If (Z\, Z 2 ) is bivariate 
normal with (fti,/i 2 ), then no UMP test exists for testing (hi,/i 2 ) = (0,0). 

Problem 14.26 In Example 14.4.3, show that the multinomal distribution can 
be written in the form (14.35) for the given orthogonal choice of functions Tj. 

Problem 14.27 Show that (14.42) holds with B = 00 if Vare[Tj(X 1 )] is uni¬ 
formly bounded in 9. Hint: Argue by contradiction. Suppose there exists h n with 
| hn | > b such that 

-®h„n- 1 / 2 ('( , n) —> i , 

where £ is less than the right side of (14.42). This is a contradiction if 

-^T^n- 1 / 2 ^) “I 1 

if \h n \ —> 00 . By taking subsequences if necessary, assume the jth component 
h n ,j of h n satisfies \h n ,j\ —> 00 . Then, 

E h n n- 1 / 2 (0n) > P h n n-H3{Zn,j > Ck,l-c} . 

It now suffices to show \Z n j\ —» 00 in probability under h„n” 1//2 . But 
\Ee[Tj(Xi)]\ increases in 9 (using properties of exponential families) while the 
variance of Z n j remains bounded. 

Problem 14.28 For testing P = Po in the model of densities (14.35) with Tj the 
normalized Legendre polynomials, show that Neyman’s smooth test is consistent 
in power against any distribution P as long as the first k moments of P are not 
all identical to the first k moments of Po- 

Problem 14.29 Let AT,...,A' n be i.i.d. random variables on [0,1] with un¬ 
known distribution P. The problem is to test P = Po, the uniform distribution 
on [0,1]. Assume a parametric model with densities of the form (14.35) for some 
fixed positive integer k. Set To (a;) = 1 and assume the functions Ti,... ,Tk are 
chosen so that To,...,Tk is a set of orthonormal functions on P 2 (Po). Assume 
that 

sup \Tj{x)\ < 00 , 

so that Ck(9) is well-defined for all k- vectors 9. Let A„ be a probability distri¬ 
bution over values of 9 and let A(tj> n , A„) denote the average power of a test (f>„ 
with respect to A„; that is, 

A(r/>n,A n )= f E e (4> n )dK n (6) . 

J e 

In particular, let A n be the fc-dimensional normal distribution with mean vector 
0 and covariance matrix equal to n -1 times the identity matrix. Among tests (j> n 
such that Eo(<j>n) —> a, find one that maximizes 

limA(0 n , A„) 
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and find a simple expression for this limiting average power. 

Problem 14.30 Use Minkowski’s Inequality (Section A.3) to show (14.47). 
Problem 14.31 Show (14.49). 

Problem 14.32 Argue the validity of (14.51). 


Section 14-5 

Problem 14.33 In Theorem 14.5.1, show that W has a continuous, strictly 
increasing distribution function on (0,oo). Hint: Write W = a:Zf + R for some 
i with a; > 0 and note that cuZf has a density. 


Problem 14.34 Show that the distribution of the Cramer-von Mises test 
statistic (14.57) under Fq is the same for all continuous distributions Fq. 


Problem 14.35 Show that the Cramer-von Mises test statistic C n given by 
(14.57) can be computed by 


Cn 


1 ” 

= i^+D x «- 


2i -1 ,2 

2 n J 


where Aqi) < • • • < X( n ) denote the order statistics; see D’Agostino and Stephens 
(1986), p.101 for computing formulas for other test statistics based on the 
empirical distribution function. 


Problem 14.36 Let F be a c.d.f. on (0,1). If 


cos( 7 r jx)dF(x) = 0 

for all j = 1,2,..., then F must be the uniform distribution on (0,1). Hint: 
Integrate by parts and use the fact the functions v 2 sin( 7 r jx) form a complete, 
orthonormal system for I/ 2 [ 0 , 1 ]. 


Problem 14.37 Show that the Anderson-Darling statistic (14.58) can be 
rewritten in the form (14.59). 

Problem 14.38 Consider W„ with Tj(x) = v / 2 cos( 7 ijx). Fix 7 j > 0 with 7 ? < 
00 . Let 

OO 

qe{x) = C{0)exp[6^2'y j T j (x)] . 

3 =1 

Show that, under 9 = hn _1//2 , 

3 


Problem 14.39 Verify the claims made in Example 14.5.4. 
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Problem 14.40 What is the characteristic function of the limiting random vari¬ 
able W of Theorem 14.5.1? As a special case, show that the characteristic function 
of the limiting null distribution of the Cramer-von Mises statistic is given by 

c(t) = n (i-S )" 172 • 

(Note this characteristic function was inverted by Smirnov; see Durbin (1973), 
P-32.) 

Problem 14.41 Show that the expression (14.65) exceeds a if there exists a j 
for which a,j > 0 and Cj ^ 0. Also, show that (14.65) is an increasing function of 
|cj|- 

Section 14-6 

Problem 14.42 Show why (14.77) is true. 

Problem 14.43 Consider the setting of Problem 8.30 with <5 = 5k — > 0 as 
k —> oo. At what rate should <5*, —> 0 as k —> oo so that the limiting maximin 
power is strictly between a and 1? 


14.8 Notes 

Goodness of fit tests based on the empirical distribution function were introduced 
by Cramer (1928), von Mises (1931) and Kolmogorov (1933). A classical refer¬ 
ence for the asymptotic theory of such tests is Durbin (1973); also see Kendall 
and Stuart (1979, Chapter 30), Neuhaus (1979) and Tallis (1983). Readable ac¬ 
counts of many goodness of fit tests can be found in D’Agostino and Stephens 
(1986) and Read and Cressie (1988). Methods particularly suitable for testing 
normality are discussed for example in Shapiro, Wilk, and Chen (1968), Hegazy 
and Green (1975), D’Agostino (1982), Hall and Welsh (1983), and Spiegelhal- 
ter (1983), and for testing exponentiality in Galambos (1982), Brain and Shapiro 
(1983), Spiegelhalter (1983), Deshpande (1983), Doksum and Yandell (1984), and 
Spurrier (1984). See also Kent and Quesenberry (1982). Modern treatments are 
provided by Shorack and Wellner (1986), van der Vaart and Wellner (1996) and 
Nikitin (1995). Some recent generalizations of the Kolmogorov-Smirnov test for 
testing goodness of fit are discussed in Beran and Millar (1986, 1988), Romano 
(1988), Khmaladze (1993), Cabana and Cabana (1997), Diimbgen (1998), and 
Polonik (1999). 

The Chi-squared test was introduced by Pearson (1900). Cohen and Sackrowitz 
(1975) prove a finite sample local optimality property of the Chi-squared test in 
the case of testing a simple null hypothesis of equal cell probabilities. In the 
context of testing a multinomial, Hoeffding (1965) compares the Chi-squared 
and likelihood ratio tests while letting a —¥ 0 as n —» oo; he finds the likelihood 
ratio test superior if the number of cells is fixed, but notes the situation can 
be reversed otherwise. As mentioned in Section 14.3, the use of the Clii-squared 
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test for testing goodness of fit for continuous observations is hampered by the 
apparent loss of information through data grouping and the choice of the number 
of groups. The choice of the number of groups is considered, among others, by 
Quine and Robinson (1985) and by Kallenberg, Oosterhoff, and Schriever (1985). 
A class of generalized Chi-squared tests is studied in Drost (1988, 1989), who uses 
the concept of Pitman asymptotic relative efficiency to study the effect of number 
of groups; a particular test, known as the Rao-Robson-Nikulin test, is advocated. 
In the case of nuisance parameters, Fisher (1924) argued that estimating nuisance 
parameters changes the limiting distribution of the Chi-squared statistic, contrary 
to early opinion. Chernoff and Lehmann (1954) showed that, when parameters 
are estimated by MLEs, the limiting distribution need not even be Chi-squared; 
also see de Wet and Randles (1987). For further discussion on the Chi-squared 
test, as well as its generalizations, see Kendall and Stuart (1979). A full account 
of the practical implementation of the Chi-squared test, including the accuracy 
of the Chi-squared approximation and choice of classes, as well as an extensive 
bibliography, are provided by Greenwood and Nikulin (1996). 

Neyman’s smooth tests were introduced in Neyman (1937b), which were seen 
to be a special case of the general score tests of Rao (1947). An elementary 
treatment is provided by Rayner and Best (1989), who also consider extensions 
to problems with nuisance parameters. The use of smooth tests for multinomial 
data with adaptive choice of order is advocated in Eubank (1997). For recent work 
on smooth tests for composite hypotheses, see Inglot, Kallenberg and Ledwina 
(1997), Pena (1998), and Fan and Lin (1998). 

Goodness of fit tests based on the Kullback-Leibler divergence are studied in 
Barron (1989). Tests based on spacings are considered in Wells, Jammalamadaka 
and Tiwari (1993). Tests based on the likelihood ratio are given in Zhang (2002). 
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15.1 Introduction 

In this chapter, we shall deal with situations where both the hypothesis and the 
class of alternatives may be nonparametric and where as a result it may be diffi¬ 
cult even to construct tests (or confidence regions) that satisfactorily control the 
level (exactly or asymptotically). For such situations, we shall develop methods 
which achieve this modest goal under fairly general assumptions. A secondary 
aim will then be to obtain some idea of the power of the resulting tests. 

In Section 15.2, we consider the class of randomization tests as a generalization 
of permutation tests. Under the randomization hypothesis (see Definition 15.2.1 
below), the empirical distribution of the values of a given statistic recomputed 
over transformations of the data serves as a null distribution; this leads to exact 
control of the level in such models. When the randomization hypothesis holds, 
the construction applies broadly to any statistic. Efficiency properties ensue if 
the statistic is chosen appropriately. 

In Section 15.3 we review some basic constructions of confidence regions and 
tests, which derive from the limiting distribution of an estimator or test sequence. 
This serves to motivate the bootstrap construction studied in Section 15.4; the 
bootstrap method offers a powerful approach to approximating the sampling 
distribution of a given statistic or estimator. The emphasis here is to find methods 
that control the level constraint, at least asymptotically. Like the randomization 
construction, the bootstrap approach will be asymptotically efficient if the given 
statistic is chosen appropriately; for example, see Theorem 15.4.2 and Corollary 
15.4.1. 

While the bootstrap is quite general, how does it compare in situations when 
other large sample approaches apply as well? In Section 15.5, we provide some 
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support to the claim that the bootstrap approach can improve upon methods 
which rely on a normal approximation. The use of the bootstrap in the context 
of hypothesis testing is studied in Section 15.6. 

While the bootstrap method is quite broadly applicable, in some situations, it 
can be inconsistent. A more general approach based on subsampling is presented 
in 15.7. Together, these approaches serve as valuable tools for inference without 
having to make strong assumptions about the underlying distribution. 


15.2 Permutation and Randomization Tests 

Permutation tests were introduced in Chapter 5 as a robust means of controlling 
the level of a test if the underlying parametric model only holds approximately. 
For example, the two-sample permutation f-test for testing equality of means 
studied in Section 11 of Chapter 5 has level a whenever the two populations 
have the same distribution under the null hypothesis (without the assumption of 
normality). In this section, we consider the large sample behavior of permutation 
tests and, more generally, randomization tests. The use of the term randomiza¬ 
tion here is distinct from its meaning in Sections 5.10. There, randomization was 
used as a device prior to collecting data, for example, by randomly assigning 
experimental units to treatment or control. Such a device allows for a meaning¬ 
ful comparison after the data has been observed, by considering the behavior 
of a statistic recomputed over permutations in the data. Thus, the term ran¬ 
domization referred to both the experimental design and the analysis of data by 
recomputing a statistic over permutations or randomizations (sometimes called 
rerandomizations) of the data. It is this latter use of randomization that we now 
generalize. Thus, the term randomization test will refer to tests obtained by re¬ 
computing a test statistic over transformations (not necessarily permutations) of 
the data. 

A general test construction will be presented that yields an exact level a test for 
a fixed sample size, under a certain group invariance hypothesis. Then, two main 
questions will be addressed. First, we shall consider the robustness of the level. For 
example, in the two-sample problem just mentioned, the underlying populations 
may have the same mean under the null hypothesis, but differ in other ways, 
as in the classical Bchrens-Fisher problem, where the underlying populations 
are normal but may not have the same variance. Then, the rejection probability 
under such populations is no longer a, and it becomes necessary to investigate 
the behavior of the rejection probability. In addition, we also consider the large 
sample power of permutation and randomization tests. In the two-sample problem 
when the underlying populations are normal with common variance, for example, 
we should like to know whether there is a significant loss in power when using a 
permutation test as compared to the UMPU t- test. 


15.2.1 The Basic Construction 

Based on data X taking values in a sample space A, it is desired to test the null 
hypothesis H that the underlying probability law P generating X belongs to a 
certain family flo of distributions. Let G be a finite group of transformations g 
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of X onto itself. The following assumption, which we will call the randomization 
hypothesis, allows for a general test construction. 


Definition 15.2.1 (Randomization Hypothesis) Under the null hypothe¬ 
sis, the distribution of X is invariant under the transformations in G; that is, for 
every g in G, gX and X have the same distribution whenever A' has distribution 
P in 12o • 


The randomization hypothesis asserts that the null hypothesis parameter space 
U 0 remains invariant under g in G. However, here we specifically do not require 
the alternative hypothesis parameter space to remain invariant (unlike what was 
assumed in Chapter 6). 

As an example, consider testing the equality of distributions based on two inde¬ 
pendent samples (Yi, ..., Y rn ) and (Z i,..., Z n ), which was previously considered 
in Sections 5.8-5.11. Under the null hypothesis that the samples are generated 
from the same probability law, the observations can be permuted or assigned at 
random to either of the two groups, and the distribution of the permuted samples 
is the same as the distribution of the original samples. (Note that a test that is 
invariant with respect to all permutations of the data would be useless here.) 

To describe the general construction of a randomization test, let T(X) be any 
real-valued test statistic for testing H. Suppose the group G has M elements. 
Given X = x, let 

T (1 \x) < T (2) (x) < ■ ■ ■< T (m) {x) 

be the ordered values of T(gx) as g varies in G. Fix a nominal level a, 0 < a < 1, 
and let k be defined by 


k = M — [Ma] , (15.1) 

where [Ma] denotes the largest integer less than or equal to Ma. Let M + (x) and 
M°(x) be the number of values T^\x) (j = 1,..., M) which are greater than 
T ( ' k \x) and equal to T( k \x), respectively. Set 


a(x) 


Ma — M + (x) 
M°(x) 


Generalizing the construction presented in Section 5.8, define the randomiza¬ 
tion test function to be equal to 1, a(x), or 0 according to whether T(x) > 
T (k) (x), T(x) = T (k) (x), or T(x ) < T^ k \x), respectively. By construction, for 
every x in X, 


<j>(gx) = M + (x) + a(x)M°(x) = Ma . (15.2) 

9CG 


The following theorem shows that the resulting test is level a, under the hy¬ 
pothesis that X and gX have the same distribution whenever the distribution of 
X is in flo- Note that this result is true for any choice of test statistic T. 


Theorem 15.2.1 Suppose X has distribution P on X and the problem is to test 
the null hypothesis P £ flo- Let G be a finite group of transformations of X 
onto itself. Suppose the randomization hypothesis holds, so that, for every g £ G, 
X and gX have the same distribution whenever X has a distribution P in flo- 
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Given a test statistic T = T(X), let f> be the randomization test as described 
above. Then, 

Ep[<f>(X)\ = a for all P £ fio • (15.3) 

Proof. To prove (15.3), by (15.2), 

Ma = E p [J 2 <KgX)] = £ E P [<t>(gX)] . 

9 9 

By hypothesis Ep[</>(gX)] = Ep[<j>(X)], so that 

Ma = ^ j E P [(t>{ A')] = ME P [4>(X)\ , 

9 

and the result follows. ■ 

To gain further insight as to why the construction works, for any x £ X, let 
G 1 denote the G-orbit of x; that is, 

G 1 = {gx : g £ G} . 

Recall from Section 6.2 that these orbits partition the sample space. The hy¬ 
pothesis in Theorem 15.2.1 implies that the conditional distribution of X given 
X £ G x is uniform on G 31 , as will be seen in the next theorem. Since this con¬ 
ditional distribution is the same for all P £ Oo, a test can be constructed to be 
level a conditionally, which is then level a unconditionally as well. Because the 
event {A' £ G x } typically has probability zero for all x, we need to be careful 
about how we state a result. As x varies, the sets G 31 form a partition of the 
sample space. Let Q be the <r-held generated by this partition. 

Theorem 15.2.2 Under the null hypothesis of Theorem 15.2.1, for any real- 
valued statistic T = T(X), any P £ flo, and any Borel subset B of the real 
line, 

P{T(X) £ B\X £G} = AW 1 I{T(gx) £ B} (15.4) 

9 

with probability one under P. In particular, if the M values ofT(gx) as g varies 
in G are all distinct, then the uniform distribution on these M values serves as 
a conditional distribution ofT(X) given that X £ G 31 . 

Proof. First, we claim that, for any g £ G and E £ Q, gE = E. To see why, 
assume y £ E. Then, g^^y £ E, because g ^y is on the same orbit as y. Then, 
gg~ 1 y £ gE or y £ gE. A similar argument shows that, if y £ gE, then y £ E, 
so that gE = E. Now, the right hand side of (15.4) is clearly <5-measurable, since 
the right hand side is constant on any orbit. We need to prove, for any E £ Q, 

[ M_1 5Z £ B}dP(x) = P{T(X) £ B, X £ E} . 

Je 9 

But, the left hand side is 

AT 1 V [ I{T(gx) £ B}dP(x) = M _1 V P{T(gX) £ B, X £ E} 

' E 
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= M- 1 P{T{gX) G B, gX G gE} = M _1 ^ P{T(gX) G B, gX G E} , 

9 9 

since gE = E. Hence, this last expression becomes (by the randomization 
hypothesis) 

M" 1 P{T(X ) GB, IgE} = P{T(X) G B, Xg£), 

9 

as was to be shown. ■ 


Example 15.2.1 (One Sample Tests) Let X = (Xi,..., X n ), where the X; 
are i.i.d. real-valued random variables. Suppose that, under the null hypothesis, 
the distribution of the Xi is symmetric about 0. This applies, for example, to the 
parametric normal location model when the null hypothesis specifies the mean 
is 0, but it also applies to the nonparametric model that consists of all distribu¬ 
tions with the null hypothesis specifying the underlying distribution is symmetric 
about 0. For * = 1,..., n, let e; take on either the value 1 or —1. Consider a trans¬ 
formation g = (ei,... ,e„) of IR n that takes x = (xi,... ,x „) to (eixi, ... ,e„x„). 
Finally, let G be the M = 2” collection of such transformations. Then, the ran¬ 
domization hypothesis holds, i.e., X and gX have the same distribution under 
the null hypothesis. ■ 


Example 15.2.2 (Two Sample Tests) Suppose Yi,..., Y rn are i.i.d. observa¬ 
tions from a distribution Py and, independently, Z\,..., Z n are i.i.d. observations 
from a distribution Pz ■ Here, X = (V) , . .., Y m , Z i,..., Z n ). Suppose that, un¬ 
der the null hypothesis, Py = Pz- This applies, for example, to the parametric 
normal two-sample problem for testing equality of means when the populations 
have a common (possibly unknown) variance. Alternatively, it also applies to 
the parametric normal two-sample problem where the null hypothesis is that the 
means and variances are the same, but under the alternative either the means or 
the variances may differ; this model was advocated by Fisher (1935a, p.122-124). 
Lastly, this setup also applies to the nonparametric model where Py and Pz may 
vary freely, but the null hypothesis is that Py = Pz- To describe an appropriate 
G, let X = m + n. For x = (xi,... ,Xn) £ IR jV , let gx G IR^ be defined by 
(*„(!),..., ic^iv)), where (7r(l),..., n(N)) is a permutation of {1,..., X}. Let G 
be the collection of all such g, so that M = N\. Whenever Py = Pz, X and 
gX have the same distribution. In essence, each transformation g produces a 
new data set gx, of which the first m elements are used as the Y sample and 
the remaining n as the Z sample to recompute the test statistic. Note that, if a 
test statistic is chosen that is invariant under permutations within each of the Y 
and Z samples (which makes sense by sufficiency), it is enough to consider the 
transformed data sets obtained by taking m observations from all N as the 
Y observations and the remaining n as the Z observations (which, of course, is 
equivalent to using a subgroup G 7 of G). 

As a special case, suppose the observations are real-valued and the underlying 
distribution is assumed continuous. Suppose T is any statistic that is a function 
of the ranks of the combined observations, so that T is a rank statistic (previously 
studied in Sections 6.8 and 6.9). The randomization (or permutation) distribution 
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can be obtained by recomputing T over all permutations of the ranks. In this 
sense, rank tests are special cases of permutation tests. ■ 


Example 15.2.3 (Tests of Independence) Suppose that X consists of i.i.d. 
random vectors X = ((Yi, Zi),, (Y n , Z n )) having common joint distribution 
P and marginal distributions Py and Pz- Assume, under the null hypothe¬ 
sis, Yi and Z % are independent, so that P is the product of Py and Pz- This 
applies to the parametric bivariate normal model when testing that the cor¬ 
relation is zero, but it also applies to the nonparametric model when the null 
hypothesis specifies Y and Zi are independent with arbitrary marginal distri¬ 
butions. To describe an appropriate G, let (-7r(l),... ,n(n)) be a permutation of 
{1,... n}. Let g be the transformation that takes ((yi, z i),..., ( y n , z„)) to the 
value ((j/i, z T (i)),..., (y n , z n ( n ))). Let G be the collection of such transforma¬ 
tions, so that M = n\. Whenever Y. and Z t are independent, X and gX have the 
same distribution. ■ 


In general, one can define a p-value p of a randomization test by 

P=Jf'E I { T (9 x )ZT(X)}. (15.5) 

9 

It can be shown (Problem 15.2) that p satisfies, under the null hypothesis, 

P{p < u } < m for all 0 < u < 1 . (15.6) 


Therefore, the nonrandomized test that rejects when p < a is level a. 

Because G may be large, one may resort to an approximation to construct 
the randomization test, for example, by randomly sampling transformations g 
from G with or without replacement. In the former case, for example, suppose 
g i,. .., gs—i are i.i.d. and uniformly distributed on G. Let 


p = 


1 

B 


B -1 

i + E nng^x) > t(x)} 

i=1 


Then, it can be shown (15.3) that, under the null hypothesis, 


(15.7) 


P{p < u} < u for all 0 < u < 1 , 


(15.8) 


where this probability reflects variation in both X and the sampling of the gi. 
Note that (15.8) holds for any B, and so the test that rejects when p < a is level 
a even when a stochastic approximation is employed. Of course, the larger the 
value of B, the closer p and p are to each other; in fact, p — p —> 0 in probability 
as B —» oo (Problem 15.4). Approximations based on auxiliary randomization 
(such as the sampling of gi) are known as stochastic approximations. 


15.2.2 Asymptotic Results 

We next study the limiting behavior of the randomization test in order to derive 
its large sample power properties. For example, for testing the mean of a normal 
distribution is zero with unspecified variance, one would use the optimal t-test. 
But if we use the randomization test based on the transformations in Example 
15.2.1, we will find that the randomization test has the same limiting power 
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as the f-test against contiguous alternatives, and so is LAUMP. Of course, for 
testing the mean, the randomization test can be used without the assumption of 
normality, and we will study its asymptotic properties both when the underlying 
distribution is symmetric so that the randomization hypothesis holds, and also 
when the randomization hypothesis fails. 

Consider a sequence of situations with X = X n , P = P n , X = X n , G = G n , 
T = T n , etc. defined for n = 1,2,...; notice we use a superscript for the data 
X = X n . Typically, X = X n = (AT,...,A'„) consists of n i.i.d. observations 
and the goal is to consider the behavior of the randomization test sequence as 
n —» oo. 

Let R n denote the randomization distribution of T n defined by 

Rn(t) = M- 1 HTn{gX n ) < t} . (15.9) 

96G„ 

We seek the limiting behavior of R n ( •) and its 1 — a quantile, which we now 
denote f n (l — a) (but in the previous subsection was denoted T^(A)); thus, 

r n (l — a) = R(( 1 (l — a) = inf{t : R n (t) > 1 — a] . 

We will study the behavior of R n under the null hypothesis and under a sequence 
of alternatives. First, observe that 

E[R n (t)] = P{T n (GnX n ) < t} , 

where G„ is a random variable that is uniform on G„ . So, in the case the ran¬ 
domization hypothesis holds, G n X n and X n have the same distribution and 
so 

E[Ra(t)} = P{T n {X n ) < t} . 

Then, if T n converges in distribution to a c.d.f. R(-) which is continuous at t, it 
follows that 

E[R n (t)} -> R(t) . 

In order to deduce R n (t) A R(t) (i.e., the randomization distribution asymp¬ 
totically approximates the unconditional distribution of T„), it is then enough 
to show Var[R n (t)] —> 0. This approach for proving consistency of R n (t) and 
r n (l — a) is used in the following result, due to Hoeffding (1952). Note that the 
randomization hypothesis is not assumed. 

Theorem 15.2.3 Suppose X n has distribution P n in X n , and G n is a finite 
group of transformations from X n to X n . Let G n be a random variable that is 
uniform on G n . Also, let G' n have the same distribution as G n , with X n , G„, 
and G' n mutually independent. Suppose, under P n , 

(T n (G„X n ),T n (G' n X n )) 4 (T, T') , (15.10) 

where T and T' are independent, each with common c.d.f. R(-). Then, under P n , 

Rn(t) 4 R(t) 

for every t which is a continuity point of R(-). Let 

r(l — a) = inf{f : R(t) > 1 — a} . 


(15.11) 
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Suppose R(-) is continuous and strictly increasing at r( 1 — a). Then, under P n , 

f„(l — a) -¥ r( 1 - a) . 

Proof. Let t be a continuity point of P(-). Then, 

Ep n [Rn(t)] = P n {T n (G n X n ) <t}^ R(t) , 

by the convergence hypothesis (15.10). It therefore suffices to show that 
V ar p n [Rn{t)\ —1 0 or, equivalently, that 

Ep n [Rl{t)\ —> R"it) ■ 


But, 

Ep n [Rim = M ~ 2 ^ ^ P n {T n (gX n ) < t, T n (g'X n ) < t} 

9 g' 


= Pn{T n (G n X n ) < t, T n (G' n X n ) <t}^ R 2 (t) , 

again by the convergence hypothesis (15.10). Hence, R n (t) —> R{t) in P n - 
probability. The convergence of f n { 1 — a) now follows by Lemma 11.2.1 
(ii). ■ 

Note that, if the randomization hypothesis holds, then T n {X n ) and T„(G„.Y n ) 
have the same distribution. The assumption (15.10) then implies the un¬ 
conditional distribution of T n {X n ) under P n converges to R in distribution. 
The conclusion is that the randomization distribution approximates this 
(unconditional) limit distribution in the sense that (15.11) holds. 


Example 15.2.4 (One Sample Test, continuation of Example 15.2.1) In 

Example 15.2.1, first consider T n = n 1/l2 X„. If P denotes the common distribu¬ 
tion of the Xt, then P n = P n is the joint distribution of the sample. Let P be any 
distribution with mean 0 and finite nonzero variance <j 2 (P) (not necessarily sym¬ 
metric). We will verify (15.10) with R{t) = < f>(t/cr(P)). Let ei,..., e n , e[,..., e' n 
be mutually independent random variables, each 1 or —1 with probability | each. 
We must find the limiting distribution of 

i 

But, the vectors (ei-Xj, e^Xi), 1 < i < n, are i.i.d. with 

E P {tiXi) = Pp(e'W) = E{u)E P (Xi) = 0 , 


EpliaXi) 2 } = E(e 2 )E P (X 2 ) = a 2 (P) = Ep^Xt) 2 ] , 


and 


CovpiuXue'iXi) = Epiae'iX 2 ) = E{et)E{^)E P {Xf) = 0 . 
By the bivariate Central Limit Theorem, 


fi ' 1/2 I> iXi , e * Xi )^( r , T ') > 
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where T and T' are independent, each distributed as N(0,a 2 (P)). Hence, by 
Theorem 15.2.3, we conclude 

Rn{t) 4 Q(t/*(P)) 

and 

fn{ 1 - a) 4 cr(P)zi- a . 

Let 4> n be the randomization test which rejects when T„ > r n {l — a), accepts 
when T n < fn( 1 — a) and possibly randomizes when T n = f„(l — a). Since T n is 
asymptotically normal, it follows by Slutsky’s Theorem that 

E P ((f > n ) = P{T n > r n ( 1 — «)} + o(l) -¥ P{<j(P)Z > a(P)zi- a } = a , 

where Z denotes a standard normal variable. In other words, we have deduced 
the following for the problem of testing the mean of P is zero versus the mean 
exceeds zero. By Theorem 15.2.1, </>„ is exact level a if the underlying distribution 
is symmetric about 0; otherwise, it is at least asymptotically pointwise level a as 
long as the variance is finite. 

We now investigate the asymptotic power of (j> n against the sequence of 
alternatives that the observations are NihnT 1 ^ 2 ,a 2 ). By the above, under 
N(0 , a 2 ), f n { 1 — a) —» azi- a in probability. By contiguity, it follows that, under 
N(hn- 1 / 2 , a 2 ), f„( 1 — a) —¥ azi- a in probability as well. Under 7V(/in -1 / 2 , a 2 ), 
T n is N(h, a 2 ). Therefore, by Slutsky’s Theorem, the limiting power of 4>n against 
N(hn~ x t 2 ,<t 2 ) is then 

Ep n {4>n) -t P{oZ + h > (TZl — a } = 1 - $(zi-a ~ ~) ■ 

a 

In fact, this is also the limiting power of the optimal t-test for this problem. Thus, 
there is asymptotically no loss in efficiency when using the randomization test 
as opposed to the optimal t-test, but the randomization test has the advantage 
that its size is a over all symmetric distributions. In the terminology of Section 
13.2, the efficacy of the randomization test is 1/a and its ARE with respect to 
the t-test is 1. In fact, the ARE is 1 whenever the underlying family is a q.m.d. 
location family with finite variance (Problem 15.6). 

In fact, the randomization test that is based on T„ is identical to the random¬ 
ization test that is based on the usual f-statistic f„. To see why, first observe 
that the randomization test based on T n is identical to the randomization test 
based on T n = T n /('^2 i X/) 1 ^ 2 , simply because all “randomizations” of the data 
have the same value for the sum of squares. But, as was seen in Section 5.2, 
t n is an increasing function of S n for positive S n - Hence, the one-sample t-test 
which rejects when t„ exceeds t n -i,i~ a , the 1 — a quantile of the t-distribution 
with n—1 degrees of freedom, is equivalent to a randomization test based on the 
statistic tn, except that t n -i,i- a is replaced by the data-dependent value. Such 
an analogy was previously made for the two-sample test in Section 5.8. 

The value of the randomization test is that one does not have to assume 
normality. On the other hand, the asymptotic results allow one to avoid the 
exact computation of the randomization distribution by approximating the criti¬ 
cal value by the normal quantile Z\- a or even t n -i,i- a - The problem of whether 
to use zi- a or fn_i,i_a is solved in Diaconis and Holmes (1994), who also give 
algorithms for the exact evaluation of the randomization distribution. ■ 
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In the previous example, it was seen that the randomization distribution 
approximates the (unconditional) null distribution of T n in the sense that 

R n {t) - P{T n < t} 4 0 

if P has mean 0 and finite variance, since P{T n < t} —» $(f/cr(P)). The following 
is a more general version of this result. 

Theorem 15.2.4 (i) Suppose Xi ,..., X n are i.i.d. real-valued, random variables 
with distribution P, assumed symmetric about 0. Assume T n is asymptotically 
linear in the sense that, for some function ipp, 

n 

T n = n- 1/2 J2^ P ( x i) + °p( 1) , (15.12) 

i=1 

where £p[^p(A'i)j = 0 and rj> = Varp[il>p(Xi)\ < oo. Also, assume ipp is an 
odd function. Let R„ denote the randomization distribution based on T n and the 
group of sign changes in Example 15.2.1. Then, the hypotheses of Theorem 15.2.3 
hold with P n = P n and R(t) = &(t/r(P)), and so 

Rn(t) 4 4»(t/r(P)) • 

(ii) If P is not symmetric about 0, let F denote its c.d.f. and define a symmetrized 
version P of P as the probability with c. d.f. 

F(t) = \[F{t) + l~F(~t)} . 

Assume T n satisfies (15.12) under P. Then, under P, 

R n {t) 4 <f>(f/V(P)) and r„(l - a) 4 r(P)z 1 - a . 

Proof. Independent of X n = [X \,..., X n ) let ei,..., e n and e(,... ,e' n be mutu¬ 
ally independent, each ±1 with probability |. Then, in the notation of Theorem 
15.2.3, G n X n = (eiXi ,..., e n X n ). Set r n (Xr,..., X n ) = T n - n" 1/2 E 1>p(Xi) 
so that r n {Xi ,.. ., X n ) 4 0. Since aX\ has the same distribution as Xi, it follows 
that r n (e 1 X 1 ,..., e n X n ) 4 0, and the same is true with a replaced by e(. Then, 

n 

(T n (G n X n ),T n (G;X n )) = n~ 1/2 J2('4’p(eiX i ),ip P {e i X i ))+op( 1) . 

i=1 

But since is odd, 'ipp(eiXi) = ei'ipp(Xi). By the bivariate CLT, 

n 

n~ 1/2 Y. MpPG)4VtPG)) 4 (T, T') , 

i =1 

where (T, T') is bivariate normal, each with mean 0 and variance r|>, and 
Cov(T, T') = Cov (enl>p{Xi),4il>p{Xi)) = P( ei )P(e')Pp[^(-V)] = 0 , 
and so (i) follows. 

To prove (ii), observe that, if -Y has distribution P and X has distribution 
P, then |X| and |X| have the same distribution. But, the construction of the 
randomization distribution only depends on the values |AT|,..., |JY„|. Hence, the 



15.2. Permutation and Randomization Tests 


641 


behavior of R n under P and P must be the same. But, the behavior of R„ under 
P is given in (i). ■ 


Example 15.2.5 (One-Sample Location Models) Suppose X \,..., X n are 
i.i.d. f(x — 9), where / is assumed symmetric about 6 o = 0. Assume the family is 
q.m.d. at 9 o with score statistic Z n . Thus, under 9o, Z n -4 1V(0, 1(9o)). Consider 
the randomization test based on T n = Z n (and the group of sign changes). It is 
exact level a for all symmetric distributions. Moreover, Z n = n _1//2 JT v(Xi , #o), 
where i) can always be taken to be an odd function if / is even. So, the assumptions 
of Theorem 15.2.4 (i) hold. Hence, when 9o = 0, 

fn( 1 - a) ->■ I 1/2 (9 0 )zi- a . 

By contiguity, the same is true under 9 n ,h = hn~ x ^ 2 . By Theorem 13.2.1, the 
efficacy of the randomization test is J 1//2 (#o). By Corollary 13.2.1, the ARE of 
the randomization test with respect to the Rao test that uses the critical value 
zi-al 1 ^ 2 (9o) (or even an exact critical value based on the true unconditional 
distribution of Z n under 9 o) is 1. Indeed, the randomization test is AUMP. There¬ 
fore, there is no loss of efficiency in using the randomization test, and it has the 
advantage of being level a across symmetric distributions. ■ 


Example 15.2.6 (Two-Sample Tests, Continuation of Example 15.2.2) 

Recall the setup of Example 15.2.2 where Yi,... ,Y m are i.i.d. Py and, inde¬ 
pendently, Zi,...,Z n are i.i.d. Pz, where Py and Pz are now assumed to be 
distributions on the real line. Let /r(P) and a 2 (P) denote the mean and variance, 
respectively, of a distribution P. Consider the test statistic 

m n 

T m , n = m 1/2 (Y m - Z n ) = m- 1/2 [Vl' - -Vz,-] . (15.13) 

z ' n A ' 
i =1 j =1 

Assume m/n —»> A G (0, oo) asm,n-^ oo. If the variances of Py and Pz are finite 
and nonzero and l^(Py) = fi(Pz), then 

Tm in A N (0,a 2 {Py) + \a 2 {P z )) . 

We wish to study the limiting behavior of the randomization test based on the 
test statistic T m ,n- If the null hypothesis implies that Py = Pz, then the ran¬ 
domization test is exact level a, though we may still require an approximation 
to its power. On the other hand, we may consider using the randomization test 
for testing the null hypothesis n(Py) = fi{Pz), and the randomization test is no 
longer exact if the distributions differ. 

Let N = m + n and write 

(X 1 ,...,X N ) = (Y 1 ,...,Y m ,Z 1 ,...,Z n ) . 

Independent of the A's, let (7r(l),... ,n(N)) and (7r'(l),... ,n'(N)) be indepen¬ 
dent random permutations of 1 In order to verify the conditions for 

Theorem 15.2.3, we need to determine the joint limiting behavior of 

N N 

(T m ,n,T^ n ) =m- 1/2 (Y / X i W i ,J2 x iWl) , 

i= 1 i =1 


(15.14) 
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where Wi = 1 if n(i) < m and Wi = —m/n otherwise; W[ is defined with 7r 
replaced by ix'. Note that E(Wi) = E(XiWi) = 0. Moreover, an easy calculation 
(Problem 15.8) gives 

Var(T m , n ) = —a 2 (Py) + a 2 (P z ) (15.15) 

n 

and 

N N 

Cov{T m , n ,T^ n ) = m- 1 ^^B(A-,A’ j lL,W J ') = 0 , (15.16) 

i= 1 1 = 1 

by the independence of the Wi and the W(. These calculations suggest the 
following result. 

Theorem 15.2.5 Assume the above setup with m/n —► A £ (0, oo). If o 2 (Py) 
and cr 2 (Pz) are finite and nonzero and p{Py) = p{Pz), then (15.If) converges 
in law to a bivariate normal distribution with independent, identically distributed 
marginals having mean 0 and variance 

t 2 = Xa 2 (P y ) + a 2 {Pz) ■ 

Proof. Assume without loss of generality that p(Py) = 0. By the Cramer-Wold 
device (Theorem 11.2.3), it suffices to show, for any a and b, 

m~ 1/2 Xi(aWi + bWi) A- N (0, (a 2 + b 2 )r 2 ) . 

i= 1 

The argument follows by conditioning on the Wi and W[ and writing the left side 
as 

m n 

m~ 1/2 Y YiAWi + bWi) + m“ 1/2 Y z AaW m+j + bW" m+j ) , (15.17) 

i= 1 1=1 

which becomes (conditionally) an independent sum of a linear combination of 
independent variables. It is not hard to check that m -1 + bW[) 2 is 

bounded in probability (because its expectation is uniformly bounded) and 

m~ x max\aWi + bW'\ 2 A 0 . (15.18) 

i 

Thus, Lemma 11.3.3 can be applied (conditionally) to each term in (15.17) and 
the result follows. ■ 

Consider the problem of testing equality of means in the two-sample problem 
without imposing parametric assumptions on the underlying distributions, which 
can be viewed as a nonparametric version of the Behrens-Fisher problem. Theo¬ 
rem 15.2.3 and Theorem 15.2.5 imply that the randomization distribution is, in 
large samples, approximately a normal distribution with mean 0 and variance r 2 . 
Hence, the critical value of the randomization test that rejects for large values 
of T m , n converges in probability to z\- a T. On the other hand, the true sampling 
distribution of T mj „ is approximately normal with mean 0 and variance 


a 2 (Py) + Xa 2 (P z ) , 
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if /.i(LV) = n{Pz). These two distributions are identical if and only if A = 1 or 
<t 2 (Py) = <J 2 {Pz )• Therefore, for testing equality of means, the randomization 
test will be pointwise consistent in level even if Py and Pz differ, as long as the 
variances of the populations are the same, or the sample sizes are roughly the 
same. In particular, when the underlying distributions have the same variance 
(as in the normal theory model assumed in Section 5.3 for which the two-sample 
f-test is UMPU), the two-sample f-test is asymptotically equivalent to the cor¬ 
responding randomization test. This equivalence is not limited to the behavior 
under the null hypothesis; see Problem 15.10. 

If the underlying variances differ and A ^ 1, the permutation test based on 
Prn,n given in (15.13) will have rejection probability that does not tend to a. 
However, if one replaces by the studentized version 

Tm.n = T m , n /^S$ + ^ S% , (15.19) 

where 

m n 

Si = (m - l)- 1 ^(Ti - Y m f and S% = (n - l)” 1 ^(Z,- - Z n f , 

;=i i=i 

then the permutation test is pointwise consistent in level for testing equality of 
means, even when the underlying distributions have possibly different variances 
and the sample sizes differ (Problem 15.11). 

Further results are given in Romano (1990). For example, two-sample permu¬ 
tations tests based on sample medians lead to tests that are not even pointwise 
consistent in level, unless the strict randomization hypothesis of equality of dis¬ 
tributions holds. Thus, if testing equality of population medians based on the 
difference between sample medians, the asymptotic rejection probability of the 
randomization test need not be a even with the underlying populations have the 
same median. 


15.3 Basic Large Sample Approximations 

In the previous section, it was shown how permutation and randomization tests 
can be used in certain problems where the randomization hypothesis holds. Un¬ 
fortunately, randomization tests only apply to a restricted class of problems. In 
this section, we discuss some generally used asymptotic approaches for construct¬ 
ing confidence regions or hypothesis tests based on data X = X n . In what follows, 
X n = (A'i,..., X n ) is typically a sample of n i.i.d. random variables taking values 
in a sample space S and having unknown probability distribution P, where P is 
assumed to belong to a certain collection P of distributions. Even outside the i.i.d. 
case, we think of the data X n as coming from a model indexed by the unknown 
probability mechanism P. The collection P may be a parametric model indexed 
by a Euclidean parameter, but we will also consider nonparametric models. 

We shall be interested in inferences concerning some parameter 9(P). By the 
usual duality between the construction of confidence regions and hypothesis tests, 
we can restrict the discussion to the construction of confidence regions. Let the 
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range of 6 be denoted 0, so that 

0 = {6{P) : P G P} . 

Typically, 0 is a subset of the real line, but we also consider more general param¬ 
eters. For example, the problem of estimating the entire cumulative distribution 
function (c.d.f.) of real-valued observations may be treated, so that 0 is an 
appropriate function space. 

This leads to considering a root R n {X n ,0(P)), a term first coined by Be- 
ran (1984), which is just some real-valued functional depending on both X n 
and 6(P). The idea is that a confidence interval for 0(P) could be constructed if 
the distribution of the root were known. For example, an estimator 9 n of a real¬ 
valued parameter 6(P) might be given so that a natural choice is R n (X n , 0(P)) = 
[0 n — 0(P)], or alternatively R n {X",0(P)) = [9 n — 0{P)\/s n , where s n is some 
estimate of the standard deviation of 0 n . 

When P is suitably large so that the problem is nonparametric in nature, a 
natural construction for an estimator 0„ of 0(P) is the plug-in estimator 0 n = 
0(P n ), where P„ is the empirical distribution of the data, defined by 

n 

P n {E) = n~ 1 Y J I { x i 6 E} . 

i= 1 

Of course, this construction implicitly assumes that <?(•) is defined for empirical 
distributions so that 0(P„) is at least well-defined. Alternatively, in parametric 
problems for which P is indexed by a parameter ip belonging to a subset 4/ of 
1R P so that P = {Py, : ip G 4/}, then 9(P) can be described as a functional t(ip). 
Hence, 0 n is often taken to be t(ip n ), where ip„ is some desirable estimator of ip, 
such as an efficient likelihood estimator. 

Let J n (P) be the distribution of R n (X n , 0{P)) under P, and let J„(-,P) be 
the corresponding cumulative distribution function defined by 

Jn(x,P) = P{R n {X n ,0(P)) < X}. 

In order to construct a confidence region for 0(P) based on the root 
Rn{X n , 0(P)), the sampling distribution J„(P) or its appropriate quantiles must 
be known or estimated. Some standard methods, based on pivots and asymptotic 
approximations, are now briefly reviewed. Note that in many of the examples 
when the observations are real-valued, it is more convenient and customary to 
index the unknown family of distributions by the cumulative distribution function 
P rather than P. We will freely use both, depending on the situation. 


15.3.1 Pivotal Method, 

In certain exceptional cases, the distribution J„(P) of R„(X n , 0(P)) under P does 
not depend on P. In this case, the root R n (X n , 0(P)) is called a pivotal quantity 
or a pivot for short. Such quantities were previously considered in Section 6.12. 
From a pivot, a level 1 — a confidence region for 0(P) can be constructed by 
choosing constants ci and C 2 so that 

P{ci < R n {X n , 0(P)) <c 2 }> 1-a . (15.20) 
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Then, the confidence region 

G n = {9 £ e : Cl < R„{X n , 9) < c 2 } 

contains 9(P) with probability under P at least 1 — a. Of course, the coverage 
probability is exactly 1 — a if one has equality in (15.20). 

Classical examples where confidence regions may be formed from a pivot are 
the following. 


Example 15.3.1 (Location and Scale Families) Suppose we are given an 
i.i.d. sample X n — (Xi,..., X n ) of n real-valued random variables, each hav¬ 
ing a distribution function of the form F[(x — 6)/a], where F is known, 6 is a 
location parameter, and a is a scale parameter. More generally, suppose 9 n is 
location and scale equivariant in the sense that 

9n(a Ai + b ,..., aX n + 6) = a9 n ( AT, • • •, X n ) -I- b ; 

also suppose a n is location invariant and scale equivariant in the sense that 

a„(aX i +b,... ,aX n +b) = |a|<r n (Ai,... ,X n ) . 

Then, the root R n (X n ,9(P)) = n}^ 2 [9 n — 9(P)\/a n is a pivot (Problem 15.14). 
For example, in the case where F is the standard normal distribution function, 9 n 
is the sample mean and a„ is the usual unbiased estimate of variance, R n has a 
t-distribution with n— 1 degrees of freedom. For another example, if a n is location 
invariant and scale equivariant, then dn/u is also a pivot, since its distribution 
will not depend on 9 or a, but will of course depend on F. When F is not 
normal, exact distribution theory may be difficult, but one may resort to Monte 
Carlo simulation of J n (P) (discussed below). This example can be generalized 
to a class of parametric problems where group invariance considerations apply, 
and pivotal quantities lead to equivariant confidence sets; see Section 6.12 and 
Problems 6.69-6.72. ■ 


Example 15.3.2 (Kolmogorov-Smirnov Confidence Bands) Suppose that 
X n = (AT, • • •, X n ) be a sample of n real-valued random variables having a dis¬ 
tribution function F. For a fixed value of x, a (pointwise) confidence interval for 
F(x) can be based on the empirical distribution function F n (x), by using the fact 
that nF n (x) has a binomial distribution with parameters n and F(x). The goal 
now is to construct a uniform or simultaneous confidence band for 9(F) = F, so 
that it is required to find a set of distribution functions containing the true F(x) 
for all x (or uniformly in x) with coverage probability 1 — a. Toward this end, 
consider the root 

Rn{ A n , F) = n 1/2 sup I F n (x) - F(x )|. 

X 

Recall that, if F is continuous, then the distribution of R n (X n , F) under F 
does not depend on F and so R„(X n , F) is a pivot (Section 6.13 and Problem 
11.57). As discussed in Section 6.13 and 14.2, the finite sample quantiles of this 
distribution have been tabled. Without the assumption that F is continuous, the 
distribution of R n (X n , F) under F does depend on F, both in finite samples and 
asymptotically. ■ 
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In general, if R„(X n ,8(P)) is a pivot, its distribution may not be explicitly 
computable or have a known tractable form. However, since there is only one 
distribution that needs to be known (and not an entire family indexed by P), the 
problem is much simpler than if the distribution depends on P. One can resort to 
Monte Carlo simulation to approximate this distribution to any desired level of 
accuracy, by simulating the distribution of R n (X n , 9(P)) under P for any choice 
of P in P. For further details, see Example 11.2.13. 


15.3.2 Asymptotic Pivotal Method 

In general, the above construction breaks down because R„(X n , 8(P)) has a 
distribution J„(P) which depends on the unknown probability distribution P 
generating the data. However, it is then sometimes the case that J n (P) con¬ 
verges weakly to a limiting distribution J which is independent of P. In this 
case, the root (sequence) R„(X n ,9(P)) is called an asymptotic pivot, and then 
the quantiles of J may be used to construct an asymptotic confidence region for 
8{P). 


Example 15.3.3 (Parametric Models) Suppose X n = (AT,...,A' n ) is a 
sample from a model {Pg, 9 € $2}, where fl is a subset of lR fe . To construct a con¬ 
fidence region for 9, suppose 8 n is an efficient likelihood estimator (as discussed 
in Section 12.4), satisfying 

n 1/2 (9 n -8) AN(0,r\8)) , 


where 1(9) is the Fisher Information matrix, assumed continuous. Then, the root 
(expressed as a function of 9 rather than Pg) 

Rn(X n , 9) = n(8 n - 9) T I(9 n )(9n - 8) 

is an asymptotic pivot. The limiting distribution is the y 2 , the Chi-squared dis¬ 
tribution with k degrees of freedom, and the resulting confidence region is Wald’s 
confidence ellipsoid introduced in Section 12.4.2. Alternatively, let 


Rn(X n ,9) 


sup^gn L n (/3) 
Ln(9) 


where L n (8) is the likelihood function (12.56). As discussed in Section 12.4.2, 
under regularity conditions, 2 log R n (X n , 8) is asymptotically y|, in which case 
R n (X n ,9) is an asymptotic pivot. ■ 


Example 15.3.4 (Nonparametric Mean) Suppose X n = (AT,..., X n ) is a 
sample of n real-valued random variables having distribution function F, and 
we wish to construct a confidence interval for 9(F) = Ep(Xi), the mean of the 
observations. Assume AT has a finite nonzero variance <j 2 (F). Let the root R n be 
the usual f-statistic defined by R n (X n , 8(F)) = n 1 ^ 2 [A'„ — 9(F)]/S n , where X n is 
the sample mean and S' 2 is the (unbiased version of the) sample variance. Then, 
J„(F) converges weakly to J = N(0, 1), and so the f-statistic is an asymptotic 
pivot. ■ 
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15.3.3 Asymptotic Approximation 

The pivotal method assumes the root has a distribution J„(P) which does not 
depend on P, while the asymptotic pivotal method assumes the root has an 
asymptotic distribution J(P) which does not depend on P. More generally, J„(P) 
converges to a limiting distribution J(P) which depends on P, and we shall now 
consider this case. Suppose that this limiting distribution has a known form which 
depends on P, but only through some unknown parameters. For example, in the 
nonparametric mean example, the root n 1 ^ 2 [ X n — #(P)] has the N(0, a 2 (F)) 
distribution, and so depends on F through the variance parameter <r 2 (P). An 
approximation of the asymptotic distribution is J(P„), where P„ is some esti¬ 
mate of P. Typically, J(P) is a normal distribution with mean zero and variance 
r 2 (P). The approximation then consists of a normal approximation based on 
an estimated variance r 2 (P„) which converges in probability to r 2 (P), and the 
quantiles of J„(P) may then be approximated by those of J(P„). Of course, this 
approach depends very heavily on knowing the form of the asymptotic distri¬ 
bution as well as being able to construct consistent estimates of the unknown 
parameters upon which J(P) depends. Moreover, the method essentially con¬ 
sists of a double approximation; first, the finite sampling distribution J„(P) is 
approximated by an asymptotic approximation J(P), and then J(P) is in turn 
approximated by J(P„). 

The most general situation occurs when the limiting distribution J(P) has an 
unknown form, and methods to handle this case will be treated in the subsequent 
sections. 


Example 15.3.5 (Nonparametric Mean, continued) In the previous ex¬ 
ample, consider instead the non-studentized root 

R n (X",9(F)) = n 1/2 [A'„ - 0(E)] . 

In this case, J n (P) converges weakly to J(P), the normal distribution with mean 
zero and variance <j 2 (P). The resulting approximation to J n (P) is the normal 
distribution with mean zero and variance S 2 . Alternatively, one can estimate the 
variance by any consistent estimator, such as the sample variance a 2 (P„), where 
F n is the empirical distribution function. In effect, studentizing an asymptotically 
normal root converts it to an asymptotic pivot, and both methods lead to the 
same solution. (However, the bootstrap approach in the next section treats the 
roots differently.) ■ 

Example 15.3.6 (Binomial p) As in Example 11.2.7, Suppose X is binomial 
based on n trials and success probability p. Let p n = X/n. As in the previ¬ 
ous example, the non-studentized root n 1 ^ 2 {p n — p) and the studentized root 
n 1 ^ 2 (p n ~ p)/[Vn{ 1 — Pn)] 1 ' 72 lead to the same approximate confidence interval 
given by (11.23). On the other hand, the Wilson interval (11.25) based on the root 
n^ 2 {p n —p)/[p( 1 — p)] 1 / 2 leads to a genuinely different solution which performs 
better in finite samples; see Brown, Cai and DasGupta (2001). ■ 

Example 15.3.7 (Trimmed mean) Suppose X n = (Xi, ..., X n ) is a sample 
of n real-valued random variables with unknown distribution function F. Assume 
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that F is symmetric about some unknown value 0(F). Let 0 n , a (X i,..., X n ) be 
the a-trimmed mean; specifically, 

^ n— [cm] 

On,ex = ^77 7 / X(i) , 

i=[cm] +1 

where Am < X( 2 ) -X7 n ) denote the order statistics and k = [an] is 

the greatest integer less than or equal to an. Consider the root R n (X n , 0(F)) = 
n l t 2 [Q n , a — 0(F)], Then, under reasonable smoothness conditions on F and as¬ 
suming 0 < a < 1/2, it is known that J n (F) converges weakly to the normal 
distribution J(F) with mean zero and variance a 2 (a,F), where 

a 2 (a,F)= 1 [ f F 11 a \t-O(F)) 2 dF(t) + 2a(F- 1 (a)-0(F )) 2 ]; 

(1 - 2a) 2 J F - i (a) 

(15.21) 

see Serfling (1980, p.236). Then, a very simple first-order approximation to J(F) 
is J(F n ), where F n is the empirical distribution. The resulting J(F n ) is just the 
normal distribution with mean zero and variance a 2 (a,F n ). ■ 

The use of the normal approximation in the previous example hinged on the 
availability of a consistent estimate of the asymptotic variance. The simple expres¬ 
sion (15.21) easily led to a simple estimator. However, a closed form expression for 
the asymptotic variance may not exist. A fairly general approach to estimating 
the variance of a statistic is provided by the jackknife estimator of variance, for 
which we refer the reader to Shao and Tu (1995, Chapter 2). However, the dou¬ 
ble approximation based on asymptotic normality and an estimate of the limiting 
variance may be poor. An alternative approach that more directly attempts to 
approximate the finite sample distribution will be presented in the next section. 


15.4 Bootstrap Sampling Distributions 

15.4-1 Introduction and Consistency 

In this section, the bootstrap, due to Efron (1979), is introduced as a general 
method to approximate a sampling distribution of a statistic or a root (dis¬ 
cussed in Section 15.3) in order to construct confidence regions for a parameter 
of interest. The use of the bootstrap to approximate a null distribution in the 
construction of hypothesis tests will be considered later as well. 

The asymptotic approaches in the previous section are not always applicable, as 
when the limiting distribution does not have a tractable form. Even when a root 
has a known limiting distribution, the resulting approximation may be poor in 
finite samples. The bootstrap procedure discussed in this section is an alternative, 
more general, direct approach to approximate the sampling distribution J n (P). 
An important aspect of the problem of estimating J n (P) is that, unlike the usual 
problem of estimation of parameters, J„(P) depends on n. 

The bootstrap method consists of directly estimating the exact finite sampling 
distribution J n (P) by J„(P„), where P n is an estimate of P in P. In this light, 
the bootstrap estimate J n (P n ) is a simple plug-in estimate of J„(P). 
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In nonparametric problems, P n is typically taken to be the empirical distribu¬ 
tion of the data. In parametric problems where P = { P i/, : ip £ 4/}, P n may be 
taken to be P^ n , where ip n is an estimate of ip. 

In general, J n (x,P„) need not be continuous and strictly increasing in x, so 
that unique and well-defined quantiles may not exist. To get around this and in 
analogy to (11.19), define 

JP 1 ( 1 — a,P) = inf{x : J n (x, P) > 1 — a} . 

If has a unique quantile Jpp 1 ( 1 — a, P), then 

P{R n (X n ,9(P)) < J-\l-a,P)} = l-a ; 

in general, the probability on the left is at least 1 — a. If J,7 1 (l — a, P) were 
known, then the region 

{0G0: R n {X n ,0) < J-\l-a,P)} 

would be a level 1 — a confidence region for 9(P). The bootstrap simply replaces 
J“ 1 ( 1 — a, P) by JTT 1 )! — a,P n ). The resulting bootstrap confidence region for 
9(P) of nominal level 1 — a takes the form 

B n (l~a,X n )={9eO:R n (X n ,9)< J" 1 (1 - a, P„)} . (15.22) 

Suppose the problem is to construct a confidence interval for a real-valued 
parameter 9{P) based on the root \9„ — 9(P)\ for some estimator 9 n . The interval 
(15.22) would then be symmetric about 9 n . An alternative equi-tailed interval 
can be based on the root 9 n — 9(P) and uses both tails of J„(P„); it is given by 

{9 £ e : Jn'(f ,A,) < Rn{X n , 9) < J~\ 1 - |,P n )} . 

A comparison of the two approaches will be made in Section 15.5. 

Outside certain exceptional cases, the bootstrap approximation J n (x, P„) can¬ 
not be calculated exactly. Even in the relatively simple case when 9(P) is the 
mean of P, the root is n 1 ^ 2 [X„ — 9(P)], and P n is the empirical distribution, the 
exact computation of the bootstrap distribution involves an n-fold convolution. 1 
Typically, one resorts to a Monte Carlo approximation to J n (P), as introduced 
in Example 11.2.13. Specifically, conditional on the data X n , for j = 1,..., B, 
let X "* = (Xi j ,..., X„j) be a sample of n i.i.d. observations from P n ; X™* 
is referred to as the jth bootstrap sample of size n. Of course, when P„ is the 
empirical distribution, this amounts to resampling the original observations with 
replacement. The bootstrap estimator J n (P„) is then approximated by the em¬ 
pirical distribution of the B values R n (Xj‘*,9 n ). Because B can be taken to be 
large (assuming enough computing power), the resulting approximation can be 
made arbitrarily close to J„(P n ) (see Example 11.2.13), and so we will subse¬ 
quently focus on the exact bootstrap estimator J n (P n ) while keeping in mind it 
is usually only approximated by Monte Carlo simulation. 

The bootstrap can then be viewed as a simple plug-in estimator of a distribu¬ 
tion function. This simple idea, combined with Monte Carlo simulation, allows 
for quite a broad range of applications. 


1 Diaconis and Holmes (1994) show how the exact bootstrap distribution can be 
calculated in some examples. 



650 15. General Large Sample Methods 


We will now discuss the consistency of the bootstrap estimator J„(P„) of the 
true sampling distribution J„(P) of R n (X n , 6(P)). Typically, one can show that 
Jn(P) converges weakly to a nondegenerate limit law J(P). Since the bootstrap 
replaces P by P„ in J n (-), it is useful to study Jn(Pn) under more general se¬ 
quences {P„}. In order to understand the behavior of the random sequence of 
distributions J n (P„), it will be easier to first understand how J„(P n ) behaves 
for certain fixed sequences {Pn}- For the bootstrap to be consistent, J„(P) must 
be smooth in P since we are replacing P by P„. Thus, we are led to studying 
the asymptotic behavior of Jn(Pn) under fixed sequences of probabilities {P n } 
which are “converging” to P in a certain sense. Once it is understood how Jn(Pn) 
behaves for fixed sequences {Pn}, it is easy to pass to random sequences {P„}. 

In the theorem below, the existence of a continuous limiting distribution is 
assumed, though its exact form need not be explicit. Although the conditions of 
the theorem are strong, they can be verified in many interesting examples. 

Theorem 15.4.1 Let C p be a set of sequences {P„ £ P} containing the sequence 
{P, P, • • •}. Suppose that, for every sequence {Pn} in C p, J„(P n ) converges weakly 
to a common continuous limit law J(P) having distribution function J(x,P). Let 
X n be a sample of size n from P. Assume that P„ is an estimate of P based on 
X n such that {Pn} falls in Cp with probability one. Then, 

sup | J„(x, P) — J n (x, P n )\ —» 0 with probability one. (15.23) 

X 

If J{-, P) is continuous and strictly increasing at 1 — a, P), then 

J~ j ( 1 — a,P„) —» J -1 ( 1 — a,P) with probability one. (15.24) 

Also, the bootstrap confidence set B n { 1 — a,X n ) given by equation (15.22) is 
pointwise consistent in level; that is, 

P{6(P) £ P n (l — a, X")} —» 1 — cc . (15.25) 


Proof. For the proof of part (15.23), note that the assumptions and Polya’s 
Theorem (Theorem 11.2.9) imply that 

sup | J n {x, P) - J n (x, P„) | -> 0 

X 

for any sequence {Pn} in Cp. Thus, since {Pn} £ C p with probability one, 
(15.23) follows. Lemma 11.2.1 implies J“ 1 (l — a, P n ) J 1 (1 — a, P) whenever 
{Pn} £ Cp; so (15.24) follows. In order to deduce (15.25), the probability on the 
left side of (15.25) is equal to 

P{R n (X n ,0(P)) < J-\l-a,Pn)} ■ (15.26) 

Under P, R n (X n ,6{P)) has a limiting distribution J(-,P) and, by (15.24), 
J“ 1 ( 1 — a,P„) —> J^ j (l —a,P). Thus, by Slutsky’s Theorem, (15.26) tends 
to J(J _1 ( 1 — a, P), P) = 1 — a. ■ 

Often, the set of sequences Cp can be described as the set of sequences {Pn} 
such that d(P„, P) — > 0, where d is an appropriate metric on the space of prob¬ 
abilities. Indeed, one should think of Cp as a set of sequences {Pn} that are 
converging to P in an appropriate sense. Thus, the convergence of J„(Pn) to 
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J(P) is locally uniform in the sense d(P n ,P) —¥ 0 implies Jn(Pn) converges 
weakly to J(P). Note, however, that the appropriate metric d will depend on the 
precise nature of the root. 

When the convergences (15.23) and (15.24) hold with probability one, we say 
the bootstrap is strongly consistent. If these convergences hold in probability, we 
say the bootstrap is weakly consistent. In any case, (15.25) holds even if (15.23) 
and (15.24) only hold in probability; see Problem 15.16. 

Example 15.4.1 (Parametric Bootstrap) Suppose X n = (Xi,... ,X n ) is a 

sample from a q.m.d. model {Pe, 8 £ 12}, where 12 C JR fc . Suppose 8 n is an 
efficient likelihood estimator in the sense that (12.62) holds. Suppose g(9 ) is a 
differentiable map from 12 to IR with nonzero gradient vector g(8). Consider 
the root R n (X n ,8 ) = n 1 ^ 2 [<?(0„) — g(8)\, with distribution function J n (x,8). By 
Theorem 12.4.1, J n (x,8) —t J(x,8), where J(x,8) = <fr(x/ae) and 

a 2 e = ■ 

One approach to estimating the distribution of n}^ 2 \g(O n ) — <?(#)] is to use the 
normal approximation IV(0,cr 2 ), where <r 2 is a consistent estimator of <rf. For 
example, if g(8 ) and 1(8) are continuous in 8, then a weakly consistent estimator 
of <jg is 

&n = g(9n)r 1 (8„)g(8„) T . 

In order to calculate <r 2 , the forms of <?(•) and /(•) must be known. This approach 
of using a normal approximation with an estimator of the limiting variance is a 
special case of asymptotic approximation discussed in Subsection 15.3.3. Because 
it may be difficult to calculate a consistent estimator of the limiting variance, and 
because the resulting approximation may be poor, it is interesting to consider 
the bootstrap method. A discussion of higher order asymptotic comparisons will 
be discussed in Section 15.5. For now, we show the bootstrap approximation 
Jn(x,8 n ) to J(x,8) is weakly consistent. 

Theorem 15.4.2 Under the above setup, under 8, 

sup | J n (x, 8) — J(x,8) | —» 0 

X 

and 

sup | Jn(x, 8 n ) J n (x, 8) [ ^ 0 (15.27) 

X 

in probability; therefore, (15.25) holds. 

Proof. By Theorem 12.4.1, for any sequence 8 n such that n}^ 2 (8 n — 8) —» h, 
Jn(x,8 n ) —¥ J(x,8). In trying to apply the previous theorem, define Cg as the 
set of sequences {d n } satisfying n 1,/2 (#„ — 8) —> h , for some finite h. (Rather than 
describe Cp as a set of sequences of distributions, we identify Pg with 8 and de¬ 
scribe Cff as a set of sequences of parameter values.) Unfortunately, 8 n does not 
fall in C g with probability one because v}^ 2 (8 n — 8) need not converge with prob¬ 
ability one. However, we can modify the argument as follows. Since n 1 ^ 2 (0„ — 8) 
converges in distribution, we can apply the Almost Sure Representation Theorem 
(Theorem 11.2.19). Thus, there exist random variables 8 n . and H defined on a 
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common probability space such that 9 n and 9 n have the same distribution and 
n^(9 n — 9) —» H almost surely. Then, {#„} £ C g with probability one, and we 
can conclude 

sup \J n (x,8 n ) - Jn(x, 9 )I -> 0 

X 

almost surely. Since 9 n and 9 n have the same distributional properties, so do 
J„(9 „) and J n (9 n ), and the result (15.27) follows. ■ 

A one-sided bootstrap lower confidence bound for g(9) takes the form 

g(9n) - n~ 1/2 - a, 9 n ) . 

The previous theorem implies, under 9, 

_^ p 

Jn (1 a, 9 n ) t (JQZ\ — a • 

Suppose now the problem is to test g(9) = 0 versus g(9 ) > 0. By the dual¬ 
ity between tests and confidence regions, one possibility is to reject the null 
hypothesis if the lower confidence bound exceeds zero, or equivalently when 
n 1/l2 g(9 n ) > Jff 1 ( 1 — a,9 n ). This test is pointwise asymptotically level a be¬ 
cause, by Slutsky’s Theorem, n 1/l2 g(8 n ) is asymptotically N(0, <r|) if g(9) = 0. 
The limiting power of this test against a contiguous sequence of alternatives is 
given in the following corollary. 

Corollary 15.4.1 Under the setup of Example 15.4-1 with 9 satisfying g(9) = 0, 
the limiting power of the test that rejects when n 1 ^ 2 g(0 n ) > J r 7 1 ( 1 —a, 9 n ) against 
the sequence 9 n = 9 + hn~ 1 ' 2 satisfies 

P? n {n 1/2 g(0 n )> J- 1 (l-a,d n )}-^l-$(z 1 - ol -cTe 1 (g(9) T ,h)) . (15.28) 

Proof. The left hand side can be written as 

Pe n {n 1/2 [g{0 n ) - g{9 n )] > J~\ 1 - a,0 n ) ~ n /2 g{9 n )} . (15.29) 

Under Pg , — a, 9 n ) converges in probability to agzi- a ', by contiguity, under 

Pg , Jn 1 (1 — a,9 n ) converges to the same constant. Also, by differentiability of 
g and the fact that g(9 ) = 0 

n 1/2 g(0„) -> (g(9) T , h) . 

By Theorem 12.4.1, the left hand side of (15.29) is asymptotically N(0,a 2 ). Let¬ 
ting Z denote a standard normal variable, by Slutsky’s theorem, (15.29) converges 
to 

P{agZ > agzi- a - {g(9) T ,h}} , 
and the result follows. ■ 

In fact, it follows from Theorem 13.5.1 that this limiting power is optimal. The 
moral is that the bootstrap can produce an asymptotically optimal test, but only 
if the initial estimator or test statistic is optimally chosen. Otherwise, if the root 
is based on a suboptimal estimator, the bootstrap approach to approximating the 
sampling distribution of a root is so good that the bootstrap will not be optimal. 
For example, in a normal location model N(9, 1), the bootstrap distribution based 
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on the root X n — 9 is exact as previously discussed (except possibly for simulation 
error), as is the bootstrap distribution for T n — 6, where T n is any location 
equivariant estimator. But, taking T n equal to the sample median would not lead 
to an AUMP test, since the bootstrap is approximating the distribution of the 
sample median, a suboptimal statistic in this case. Furthermore, this leads to the 
observation that the bootstrap can be used adaptively to approximate several 
distributions, and then inference can be based on the one with better properties; 
see Leger and Romano (1990a,b). 


15-4-2 The Nonparametric Mean 

In this section, we consider the case of Example 15.3.4, confidence intervals for 
the nonparametric mean. This example deserves special attention because many 
statistics can be approximated by linear statistics. We will examine this case in 
detail, since similar considerations apply to more complicated situations. Given 
a sample X n = (X\,... ,X n ) from a distribution F on the real line, consider 
the problem of constructing a confidence interval for 9(F) = Ef(X(). Let <j 2 (F) 
denote the variance of F. The conditions for Theorem 15.4.1 are verified in the 
following result. 


Theorem 15.4.3 Let F be a distribution on the line with finite, nonzero variance 
o 2 (F). Let J n (F) be the distribution of the root R n (X n , 9(F)) = n 1/2 [X„-6(F)]. 

(i) Let C f be the set of sequences {F n } such that F n converges weakly to F, 

9(F n ) 9(F), anda 2 (F„) —> a 2 (F). If{F n } € C f, then J n (F n ) converges 

weakly to J(F), where J(F) is the normal distribution with mean zero and 
variance a 2 (F). 

(ii) Let Xi,..., X n be i.i.d. F, and let F n denote the empirical distribution 
function. Then, the bootstrap estimator J n (F n ) is strongly consistent so 
that (15.23), (15.24), an d (15.25) hold. 

Proof of Theorem 15.4.3. For the purpose of proving (i), construct variables 
X n ,i ,..., X n , n which are independent with identical distribution F n , and set 
X„ = J2iXn,i/n. We must show that the law of nf^ 2 (X n — p(F„)) converges 
weakly to J(F). It suffices to verify the Lindeberg Condition for Y n ,i , where 
Yn,i = X n ,i — p(F n ). This entails showing that, for each e > 0, 

lim S^ilO'n! > ne 2 )] = 0 . (15.30) 

Note that Y n , i —» Y, where Y = X — p(F) and X has distribution F, and 
E{Yn, l) —> E(Y 2 ). By the continuous mapping theorem (Theorem 11.2.13), 
Y 2 ! -4 Y 2 . Now, for any fixed (3 > 0 and all n > (3/e 2 , 

£[K„ a ,il(y„ 2 ,i > ne 2 )} < E\Y 2 ,\l(Y 2 ,i > 5)] -> E[Y 2 1 (Y 2 > /?)] , 

where the last convergence holds if (3 is a continuity point of the distribution of 
Y 2 , by (11.40). Since the set of continuity points of any distribution is dense and 
E[Y 2 1(Y 2 > (3)] 0 as (3 —> oo, Lindeberg’s Condition holds. 
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We now prove (ii) by applying Theorem 15.4.1; we must show that {F n } £ C p 
with probability one. By the Glivenko-Cantelli theorem, 

sup | P„(x) — F(x )| —» 0 with probability one . 

X 

Also, by the Strong Law of Large Numbers, 9(F n ) —» 9(F) with probability 
one and a 2 (F n ) —> a 2 (F ) with probability one. Thus, bootstrap confidence in¬ 
tervals for the mean based on the root R n (X n , 9(F)) = n}^ 2 (X n — 9(F)) are 
asymptotically consistent in the sense of the theorem. ■ 


Remark 15.4.1 Let F and G be two distribution functions on the real line and 
define d p (F,G) to be the infimum of {.E[|X — y| p ]} 1 / p over all pairs of random 
variables X and Y such that X has distribution F and Y has distribution G. It 
can be shown that the infimum is attained and that d p is a metric on the space of 
distributions having a pth moment. Further, if F has a finite variance a 2 (F), then 
d 2 (F n , F) —» 0 is equivalent to F n converging weakly to F and a 2 (F n ) —> a 2 (F). 
Hence, Theorem 15.4.3 may be restated as follows. If F has a finite variance 
u 2 (F) and d 2 (F n ,F) —> 0, then J„(F n ) converges weakly to J(F). The metric d 2 
is known as the Mallow’s metric. For details, see Bickel and Freedman (1981). 

Continuing the example of the nonparametric mean, it is of interest to consider 
roots other than n}^ 2 (X n — 9(F)). Specifically, consider the studentized root 

Rn(X n , 9(F)) = n 1/2 (X n - 9(F))/a(F n ) , (15.31) 

where cr 2 (F n ) is the usual bootstrap estimate of variance. To obtain consistency 
of the bootstrap method, called the bootstrap-f, we appeal to the following result. 

Theorem 15.4.4 Suppose F is a c.d.f. with finite nonzero variance a 2 (F). Let 
K n (F) be the distribution of the root (15.31) based on a sample of size n from F. 

(i) Let C f be defined as in Theorem 15.4-3. Then, for any sequence {F n } £ 
C_f, K n (F n ) converges weakly to the standard normal distribution. 

(ii) Hence, the bootstrap sampling distribution K n (F n ) is consistent in the sense 
that equations (15.23), (15.24), an d (15.25) hold. 

Before proving this theorem, we first need a weak law of large numbers for a 
triangular array that generalizes Theorem 11.2.10. The following lemma serves 
as a suitable version for our purposes. 

Lemma 15.4.1 Suppose Y n , 1 ,..., Y ni „ is a triangular array of independent ran¬ 
dom variables, the n-th row having c.d.f. G n . Assume G n converges in distribution 
to G and 

E[\Y nil \] -4- E[\Y\] < 00 
as n —¥ 00 , where Y has c.d.f. G. Then, 

n 

Y n = n" 1 Y n,i 4 E(Y) 

i =1 


as n — 00 . 
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PROOF. Apply Lemma 11.4.2 and (11.40). ■ 

Proof of Theorem 15.4.4. For the proof, let X Ht i, ..., X n , n be independent 
with distribution F n . By Theorem 15.4.3 and Slutsky’s Theorem, it is enough to 
show a 2 (F n ) —» cr 2 (F) in probability under F n . But, 

a 2 (F n ) = -Yxx n ,i - x n ) 2 . 

n z ' 

i 

Now, apply Lemma 15.4.1 on the Weak Law of Large Numbers for a triangu¬ 
lar array with Y n = X n and also with Y„ t i = X 2 . t . The consistency of the 
bootstrap method based on the root (15.31) now follows easily. ■ 

It is interesting to consider how the bootstrap behaves when the underlying 
distribution has an infinite variance (but well-defined mean). The short answer 
is that the bootstrap procedure considered thus far will fail, in the sense that the 
convergence in expression (15.23) does not hold. The failure of the bootstrap for 
the mean in the infinite variance case was first noted by Babu (1984); further 
elucidation is given in Athreya (1987) and Knight (1989). In fact, a striking 
theorem due to Gine and Zinn (1989) asserts that the simple bootstrap studied 
thus far will work for the mean in the sense of strong consistency if and only if 
the variance is finite. For a nice exposition of related results, see Gine (1997). 

Related results for the studentized bootstrap based on approximating the dis¬ 
tribution of the root (15.31) were considered by Csorgo and Mason (1989) and 
Hall (1990). The conclusion is that the bootstrap is strongly or almost surely 
consistent if and only if the variance is finite; the bootstrap is weakly consistent 
if and only if Xi is in the domain of attraction of the normal distribution. 

In fact, it was realized by Athreya (1985) that the bootstrap can be modified 
so that consistency ensues even with infinite variance. The modification consists 
of reducing the bootstrap sample size. Further results are given in Arcones and 
Gine (1989, 1991). In fact, In other instances where the simple bootstrap fails, 
consistency can often be recovered by reducing the bootstrap sample size. The 
benefit of reducing the bootstrap sample size was recognized first in Bretagnolle 
(1983). An even more general approach based on subsampling will be considered 
later in Section 15.7. 


15-4-3 Further Examples 

Example 15.4.2 (Multivariate Mean) Let X" = (AT, ..., X n ) be a sample 
of n observations from F, where X , takes values in IR fc . Let 9(F) = Ef(Xi) be 
equal to the mean vector, and let 

S n (X n , 9(F)) = n 1/2 (X n - 9(F)) , (15.32) 

where X n = JT Xi/n is the sample mean vector. Let 

R n (X n ,9(F)) = \\S n (X n ,9(F))\\ , 

where || • || is any norm on IR fc . The consistency of the bootstrap method based 
on the root R n follows from the following theorem. 
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Theorem 15.4.5 Let L n (F) be the distribution (in IR fc J of S n (X n , 9(F)) under 
F, where S n is defined in (15.32). Let T,(F) be the covariance matrix of S n 
under F. Let Cf be the set of sequences {F„} such that F n converges weakly to 
F and Yl(F n ) —> E (F), so that each entry of the matrix S(j p n ) converges to the 
corresponding entry (assumed finite) ofS(F). 

(i) Then, L n (F n ) converges weakly to L(F), the multivariate normal distribu¬ 
tion with mean zero and covariance matrix E (F). 

(ii) Assume E (F) contains at least one nonzero component. Let || • || be 
any norm on IR fe and let J n (F) be the distribution of R n (X ", 9(F)) = 
||jSn(.X' n ,#(F))|| under F. Then, J„(F n ) converges weakly to J(F), which 
is the distribution o/||Z|| when Z has distribution L(F). 

(in) Suppose X \,..., X n are i.i.d. F with empirical distribution F n (in IR fc ). 
Then, the bootstrap approximation satisfies 

p(Jn(F), Jn(Fn)) —> 0 with probability one , 

and bootstrap confidence regions based on the root R n are consistent in the 
sense that the convergences (15.23) to (15.25) hold. 

Proof. The proof of (i) follows by the Cramer-Wold device (Theorem 11.2.3) 
and by Theorem 15.4.3 (i). To prove (ii), note that any norm || • || on IR fc is 
continuous almost everywhere with respect to L(F). A proof of this statement 
can be based on the fact that, for any norm || • ||, the set {x € IR fe : ||a;|| = c} 
has Lebesgue measure zero because it is the boundary of a convex set. So, the 
continuous mapping theorem applies and so J n (F„) converges weakly to J(F). 

Part (iii) follows because {T,,} £ C f with probability one, by the Glivenko- 
Cantelli theorem (on IR fc ) and the strong law of large numbers. ■ 

Note the power of the bootstrap method. Analytical methods for approximat¬ 
ing the distribution of the root R n = HSViH would depend heavily on the choice 
of norm || • ||, but the bootstrap handles them all with equal ease. 

Let E„ = E (F) be the sample covariance matrix. As in the univariate case, 
one can also bootstrap the root defined by 

R n (X n ,9(F)) = ||E~ 1/2 (X n - 0(F))||, (15.33) 

provided E (F) is assumed positive definite. In the case where || • || is the usual 
Euclidean norm, this root leads to confidence ellipsoid, i.e., a confidence set whose 
shape is an ellipsoid. 


Example 15.4.3 (Smooth Functions of Means) Let AT, ..., X n be i.i.d. S- 
valued random variables with distribution P. Suppose 9 = 9(P) = (9 1 ,... ,9 P ), 
where 9j = Ep[hj( AT)] and the hj are real-valued functions defined on S. In¬ 
terest focuses on 9 or some function / of 9. Let 9 n = (9 n ,i, ■ ■ ■ ,9 n , P ), where 
9 n ,j = 5^r=i hj(Xi)/n. Assume moment conditions on the hj(Xi). Then, by 
the multivariate mean case, the bootstrap approximation to the distribution of 
n}^ 2 (9 n — 9) is appropriately close in the sense 

p ^/Zp(n 1,/2 (9ri — 9)), Cp * (n 1 ' / “(0* — 9 n ))^j —> 0 


(15.34) 
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with probability one, where p is any metric metrizing weak convergence in IR P 
(such as the Bounded-Lipschitz metric introduced in Problem 11.23). Here, P„ 
refers to the distribution of the data resampled from the empirical distribution 
conditional on X\.... X n . Moreover, 

P (rp(n 1/2 (^„ - e)),C(Z)) -*• 0 , (15.35) 

where Z is multivariate normal with mean zero and covariance matrix E having 
component 

Cov(Zi,Zj) = Cov[hi(X\), hj(X 1 )}. 

To see why, define Y„ to be the vector in 1R P with j-th component hj(Xi), 
so that we are exactly back in the multivariate mean case. Now, suppose / is 
an appropriately smooth function from TR P to IR 9 , and interest now focuses 
on the parameter p = f(9). Assume / = (/i,..., f q ) T , where fi(yi, ■ ■ ■, y P ) is 
a real-valued function from IR P having a nonzero differential at (j/i, • • •, y p ) = 
(9\,... ,9 P ). Let D be the q x p matrix with (i,j) entry dfi(yi,... ,y P )/dyj 
evaluated at (9 i, ..., 9 P ). Then, the following is true. 

Theorem 15.4.6 Suppose f is a function satisfying the above smoothness as¬ 
sumptions. If E[hj(Xi)] < oo, then equations (15.3f) and (15.35) hold. 
Moreover, 

p ( C P (n 1/2 [f(L ) - /(0)]),£p.(n 1/2 [/(£;) - /(*„)])) -4 0 
with probability one and 

sup|p{||/(0 n ) - m II < s} - P:{\\f{6l) - f(L )II < a}| -»• o 

with probability one. 

Proof. The proof follows as equations (15.34) and (15.35) are immediate from 
the multivariate mean case, and the smoothness assumptions on / and the 
Delta Method imply that n 1/,2 [/(0,i) — f(9)] has a limiting multivariate normal 
distribution with mean 0 and covariance matrix DED T ; see Theorem 11.2.14. ■ 


Example 15.4.4 (Joint Confidence Rectangles) Under the assumptions of 
Theorem 15.4.6, a joint confidence set can be constructed for (fi(9),... ,f q {9)) 
with asymptotic coverage 1 — a. In the case where ||x|| = max|xi|, the set is a 
rectangle in Iff 1 . Such a set is easily described as 

{f(9): \fi0n)-fi{O)\<bn(l-a) for all*}, 

where b n (l — a) is the bootstrap approximation to the 1 — a quantile of the 
distribution of maxi \fi(9 n ) — fi(9) \. Thus, a value for fi(9) is included in the 
region if and only if fi(9) £ fi.(9„) ± &„(1 — a). Note, however, the intervals 
fi(9 n ) ± b n (l — a) may be unbalanced in the sense that the limiting coverage 
probability for each marginal parameter fi(9) may depend on i. To fix this, one 
could instead bootstrap the distribution of maxi | fi(9 n ) — fi(9)\/ a n ,i, where <7„,i 
is some consistent estimate of the (i, i) entry of the asymptotic covariance matrix 
DYjD t for n 1//2 f(9 n ). For further discussion, see Beran (1988a), who employs a 
transformation called prepivoting to achieve balance. 
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Example 15.4.5 (Uniform Confidence Bands for a c.d.f. F) Consider a 
sample A" = (Ai,...,A'„) real-valued observations having c.d.f. F. The 
empirical c.d.f. F n is then 

n 

F n {t) = n~ 1 Y J I{Xi<t} . 

i= 1 

For two distribution functions F and G, define the Kolmogorov-Smirnov (or 
uniform) metric 

<1k{F, G) = sup \F(t) — G(t)\ . 

t 

Now, consider the root 

R n (X n ,6(F)) =n 1/2 d K {F n ,F) , 

whose distribution under F is denoted J n (F). As discussed in Example 11.2.12, 
J n (F) has a continuous limiting distribution. In fact, the following triangular 
array convergence holds. If d,K(F n , F) — > 0, then J n (F n ) -4- J(F ); for a proof, see 
Politis, Romano, and Wolf (1999, p.20). Thus, we can define Cf to be the set 
of sequences {F n } satisfying d,K(F n ,F ) —> 0. By the Glivenko-Cantelli Theorem, 
dK(F n ,F) —> 0 with probability one, and strong consistency of the bootstrap 
follows. The resulting uniform confidence bands for F are then consistent in the 
sense that (15.25) holds, and no assumption on continuity of F is needed (unlike 
the classical limit theory). This example has been generalized considerably, and 
the proof depends on the behavior of n 1 ^ 2 [F n (t) — F(t)], which can be viewed as a 
random function and is called the empirical process. The general theory of boot¬ 
strapping empirical processes is developed in van der Vaart and Wellner (1996) 
and in Chapter 2 of Gine (1997). In particular, the theory generalizes to quite 
general spaces S, so that the observations need not be real-valued. In the special 
case when S is fc-dimensional Euclidean space, the ^-dimensional empirical pro¬ 
cess was considered in Beran and Millar (1986). Confidence sets for a multivariate 
distribution based on the bootstrap can then be constructed which are pointwise 
consistent in level. ■ 


15-4-4 Stepdown Multiple Testing 

Suppose data A' = X n is generated from some unknown probability distribution 
P, where P belongs to a certain family of probability distributions fi. For j = 
1, ,.., s, consider the problem of simultaneously testing hypotheses Hj : P £ u)j. 

For any subset K C {l,...,s}, let Hk = fljerc be the hypothesis that 
P £ f], eK UJ :i■ Suppose that a test of the individual hypothesis Hj is based on a 
test statistic T nt j, with large values indicating evidence against the Hj. 

The goal is to construct a stepdown method that controls the familywise error 
rate (FWER). Recall that the FWER is the probability of rejecting at least one 
true null hypothesis. More specifically, if P is the true probability mechanism, 
let I — I(P) C {1,..., s} denote the indices of the set of true hypotheses; that 
is, i £ I if and only P £ un. Then, FWER is the probability under P that any 
Hi with i £ I is rejected. To show its dependence on P, we may write FWER 
= FWERp. We require that any procedure satisfy that the FWER be no bigger 
than a (at least asymptotically). 
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Suppose 77; is specified a real-valued parameter fii{P) = 0. Then, one approach 
to constructing a multiple test is to invert a simultaneous confidence region. 
Under the setup of Example 15.4.4, with /3;(P) = fi(9(P)), any hypothesis Hi 
is rejected if fi(9 n ) > b n ( 1 — a). A procedure that uses a common critical value 
6„( 1 — a) for all the hypotheses is called a single-step method. 

Another approach is to compute (or approximate) a p -value for each individual 
test, and then use Holm’s method discussed in Section 9.1, However, Holm’s 
method, which makes no assumptions about the dependence structure of the 
test statistics, can be improved by methods that implicitly or explicitly estimate 
this dependence structure. In this section, we consider a stepdown procedure 
that incorporates the dependence structure and thereby improves upon the two 
methods just described. 

Let 


Tn, ri > T n ,r 2 > ■ > T n ,r s (15.36) 

denote the observed ordered test statistics, and let H ri , H r2 ,..., 77 rs be the 
corresponding hypotheses. 

Recall the stepdown method presented in Procedure 9.1.1. The problem now 
is how to construct the c n ,ic( 1 — a) so that the FWER is controlled, at least 
asymptotically. The following is an immediate consequence of Theorem 9.1.3, and 
reduces the multiple testing problem of asymptotically controlling the FWER to 
the single testing problem of asymptotically controlling the probability of a Type 
1 error. 


Corollary 15.4.2 Let P denote the true distribution generating the data. Con¬ 
sider Procedure 9.1.1 based on critical values c ni x(l — a) which satisfy the 
monotonicity requirement: for any K D 7(P), 


C n ,K (1 — O') > C-ri,I(P) (1 — o) ■ 

If Cn,i(p){ 1 — a) satisfies 

limsup P{max(T n j : j £ J(P)) > c n ,/ ( p)( 1 - a)} < a , 

n 


(15.37) 

(15.38) 


then limsup n FWERp —> a as n —> oo. 


Under the monotonicity requirement (15.37), the multiplicity problem is ef¬ 
fectively reduced to testing a single intersection hypothesis at a time. So, the 
problem now is to construct intersection tests whose critical values are monotone 
and asymptotically control the rejection probability. 

We now specialize a bit and develop a concrete construction based on the 
bootstrap. Suppose hypothesis 77; is specified by {P : 9i(P) = 0} for some 

real-valued parameter 9i, and 9 n ,i is an estimate of 9i. Also, let T n ,i = T n \9 n ,i\ 
for some nonnegative (nonrandom) sequence r„ —> oo; usually, t„ = n 1 ^ 2 . The 
bootstrap method relies on its ability to approximate the joint distribution of 
{rn[9n,i — 9i(P)] : i £ K}, whose distribution we denote by J„,k(P). Also, let 
Ln,K (P) denote the distribution under P of max{r„|0„,; — 0;(P)| : i £ A'}, with 
corresponding distribution function L nt K{x, P) and a-quantile 


b n ,K(a, P) = inf {a : L„ t K (x, P) > a} . 
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Let Q„ be some estimate of P. Then, a nominal 1 —a level bootstrap confidence 
region for the subset of parameters {0i(P) : * € A'} is given by 

{(0i ■ i £ K) ■ maxr n |0 n ,j - 9i\ < b n , K ( 1 - a,Q n )} . 

i£K 

So a value of 0 for 9i(P) falls outside the region iff T„ = r n \9n,i\ > b„ M l- 
a, Q n )- By the usual duality of confidence sets and hypothesis tests, this suggests 
the use of the critical value 

fin,x(l - a) = 6„,k(1 - a, Q„) , (15.39) 

at least if the bootstrap is a valid asymptotic approach for confidence region 
construction. 

Note that, regardless of asymptotic behavior, the monotonicity assumption 
(15.37) is always satisfied for the choice (15.39). Indeed, for any Q and if I C K, 
bn,i (1 — a, Q) is the 1 — a quantile under Q of the maximum of |/| variables, 
while bn,K{ 1 — a,Q) is the 1 — a quantile of these same |/| variables together 
with |if | — |J| variables. 

Therefore, in order to apply Theorem 15.4.2 to conclude lim sup n FWERp < a, 
it is now only necessary to study the asymptotic behavior of 6„,x(l — a,Q n ) 
in the case K = I(P). For this, we assume the usual conditions for bootstrap 
consistency when testing the single hypothesis that 9i(P) = 0 for all i £ i(P); 
that is, we assume the bootstrap consistently estimates the joint distribution of 
Tn[d n ,i — 9i(P)] for i £ I(P) . In particular, we assume 

Jn,I(P){P ) ~t Jl(P)(P ) ) (15.40) 

a nondegenerate limit law. Assumption (15.40) implies L n! /(p)(P) has a limiting 
distribution Lj(p)(P), with c.d.f. denoted A/(p)(x,P). We will further assume 
Li(p){P) is continuous and strictly increasing on its support. It follows that 

b n ,i(P){ 1 — oi, P) —> &/(p) (1 — a, P) , (15.41) 

where 6/(p)(a,P) is the a-quantile of the limiting distribution A/(p)(P). 

Theorem 15.4.7 Fix P and assume (15-40) and that Lj(p)(P) is continuous 
and strictly increasing on its support. Let Q n be an estimate of P satisfying: for 
any metric p metrizing weak convergence on IRj /( - p ^, 

P {j-n,I(P)(P), Jn,I(P){Qn)^J ~> 0 . (15.42) 

Consider the generic stepdown method in Procedure 9.1.1 with c n ,K(l — a) equal 
to 6„,k(1 - a,Q n ). Then, limsup„ FWERp < a. 

Proof. By the Continuous Mapping Theorem and a subsequence argument 
(Problem 15.28), the assumption (15.40) implies 

pi (^L n ,i(p)(P), Ln,/(p)(Qn)j —> 0 , (15.43) 

where pi is any metric metrizing weak convergence on IR. It follows from 11.2.1 (ii) 
that 

/s p 

bn,I (P) (1 — O ’, Q n ) —t 6/(p)(l — a, P) . 
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By Slutsky’s Theorem, 

P {max(T nJ : j € /(.P))} > 6„,/(p)( 1 - a,Q„)} -» 1 - T/(p)(&j(p)(l - a,P),P), 
and the last expression is a. ■ 

Example 15.4.6 (Multivariate Mean) Assume X, = (X^i,..., X, iS ) are n 
i.i.d. random vectors with _E(|X;| 2 ) < oo and mean vector p = (pi,...,p s ). 
Note that the vector X; can have an arbitrary s-variate distribution, so that 
multivariate normality is not assumed as it was in Example 9.1.4. Suppose Hi 
specifies pi s= 0 and T n p = n -1 ^ 2 1 y)" =1 -Xj'.i | - Then, the conditions of Theorem 
15.4.7 are satisfied by Example 15.4.2. Alternatively, one can also consider the 
studentized test statistic t n ,i = T n p/S n p, where S 2 ^ is the sample variance of 
the ith components of the data (Problem 15.29). ■ 

Example 15.4.7 (Comparing Treatment Means) For * = 1,..., k, suppose 
we observe k independent samples, and the 1th sample consists of rii i.i.d. ob¬ 
servations Xgi,..., Xj, ni with mean pt and finite variance cr 2 . Hypothesis Hij 
specifies pi = / ij , so that the problem is to compare all s = (*) means. (Note 
that we are indexing hypotheses and test statistics now by 2 indices i and j.) 
Let Tn,i,j — ti ' -A,,., X lt: j, where Xnp — 5 -y =i -A-/ ,j/ft ,. Let Qni,i be the 
empirical distribution of the 1th sample. The bootstrap resampling scheme is to 
independently resample m observations from Q n ,i, i = 1 ,,k. Then, Theorem 
15.4.7 applies and it also applies to appropriately studentized statistics (Problem 
15.30) The setup can easily accommodate comparisons of k treatments with a 
control group (Problem 15.31). ■ 

Example 15.4.8 (Testing Correlations) Suppose Xi,...,X n are i.i.d. ran¬ 
dom vectors in IR fc , so that X* = (Xgi,..., X^). Assume E\Xij\ 2 < oo and 
Var(Xij) > 0, so that the correlation between Xg, and ATj, namely pij is 
well-defined. Let Hij denote the hypothesis that pij = 0, so that the multiple 
testing problem consists in testing all s = (J) pairwise correlations. Also let 
denote the ordinary sample correlation between variables i and j. (Note that we 
are indexing hypotheses and test statistics now by 2 indices i and j.) By Exam¬ 
ple 15.4.3, the conditions for the bootstrap hold because correlations are smooth 
functions of means. ■ 


15.5 Higher Order Asymptotic Comparisons 

One of the main reasons the bootstrap approach is so valuable is that it can be 
applied to approximate the sampling distribution of an estimator in situations 
where the finite or large sample distribution theory is intractable, or depends on 
unknown parameters. However, even in relatively simple situations, we will see 
that there are advantages to using a bootstrap approach. For example, consider 
the problem of constructing a confidence interval for a mean. Under the assump¬ 
tion of a finite variance, the standard normal theory interval and the bootstrap-f 
are each pointwise consistent in level. In order to compare them, we must con¬ 
sider higher order asymptotic properties. More generally, suppose /„ is a nominal 
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1 — a level confidence interval for a parameter 6(P). Its coverage error under P 
is 


P{0(P) £ In} - (1 - a) , 

and we would like to examine the rate at which this tends to zero. In typical prob¬ 
lems, this coverage error is a power of n -1 ^ 2 . It will be necessary to distinguish 
one-sided and two-sided confidence intervals because their orders of coverage error 
may differ. 

Throughout this section, attention will focus on confidence intervals for the 
mean in a nonparametric setting. Specifically, we would like to compare some 
asymptotic methods based on the normal approximation and the bootstrap. Let 
X n = (Xi,... ,X n ) be i.i.d. with c.d.f. F, mean 9{F), and variance <j 2 (F). Also, 
let F n denote the empirical c.d.f., and let a n = cr(F„). 

Before addressing coverage error, we recall from Section 11.4.1 the Edgeworth 
expansions for the distributions of the roots 

R n {X",F) =n 1/2 (X„-0(F)) 

and 

K{X n , F) = n 1/2 (X n - d(F))/a n ; 

as in Section 15.4.2, their distribution functions under F are denoted J n (-,F) 
and K n (-,F), respectively. Let <t> and tp denote the standard normal c.d.f. and 
density, respectively. 


Theorem 15.5.1 Assume Ep(Xf) < oo. Let ipF denote the characteristic 
function of F, and assume 

limsup IV’f(s)! < 1 . (15.44) 

| s | —^OO 

Then, 

Jn(t, F) = *(t/a(F)) - lT(F)p(t/a(F))(J^- - 1 )n“ 1/2 + 0(0 , (15.45) 
where 

■y(F) = EfIX! - 6{F)} 3 /a 3 {F) 

is the skewness of F. Moreover, the expansion holds uniformly in t in the sense 
that 

Jn(t,F) = m/a(F )) - \l(F)v(t/cr{F)){-^ - 1 )n~ 1/2 ] + R n (t, F) , 
where \R n {t, E)! < C/n for all t and some C = Cf which depends on F. 


Theorem 15.5.2 Assume Ep(Xf) < oo and that F is absolutely continuous. 
Then, uniformly in t, 

K n (t, F) = 4>(f) + l'y{F)p(t)(2t 2 + 1 )n~ 1/2 + 0(0 • 

6 


(15.46) 
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Note that the term of order n -1//2 is zero if and only if the underlying skewness 
7 (F) is zero, so that the dominant error in using a standard normal approxima¬ 
tion to the distribution of the studentized statistic is due to skewness of the 
underlying distribution. We will use these expansions in order to derive some 
important properties of confidence intervals. Note, however, that the expansions 
are asymptotic results, and for finite n, including the correction term (i.e. the 
term of order n -1 ^ 2 ) may worsen the approximation. 

Expansions for the distribution of a root such as (15.45) and (15.46) imply 
corresponding expansions for their quantiles, which are known as Cornish-Fisher 
Expansions. For example, K^ 1 (1 — a, F) is a value of t satisfying K n (t , F) = 1 —a. 
Of course, K^ 1 { 1 — a, F) —» z\- a . We would like to determine c = c(a, F ) such 
that 

A'“ 1 (l - a,F) = zi- a + cn~ 1/2 + 0(n _1 ) . 

Set 1 — a equal to the right hand side of (15.46) with t = zi- a + cn _1//2 , which 
yields 

+ cn _1/2 ) + ^j(F)<p(zi- a + cn _ 1 / 2 )( 2 z 2 _ a + l)n _1/2 + 0 (n _1 ) = 1 - a . 
By expanding $ and <p about zi- a , we find that 

c=-p(F)(2z 2 i_ a + l) ■ 

Thus, 

K~\ 1 — a,F) = Zl - a - ^{F){2zl_ a + l)n ~ 1/2 + 0(n“ 1 ) . (15.47) 

In fact, under the assumptions of Theorem 15.5.2, the expansion (15.46) holds 
uniformly in t, and so the expansion (15.47) holds uniformly in a £ [e, 1 — e], for 
any e > 0 (Problem 15.34). Similarly, one can show (Problem 15.35) that, under 
the assumptions of Theorem 15.5.1, 

J~\l — a, F) = o(F)zi- a + ^o(F)j{F){zf_ a - 1 )n “ 1/2 + O^” 1 ) , (15.48) 
uniformly in a G [e, 1 — e]. 

Normal Theory Intervals. The most basic approximate upper one-sided confidence 
interval for the mean 6(F) is given by 

A'„ + n~ 1/2 a n zi- a , (15.49) 

where <r 2 = a 2 (F n ) is the (biased) sample variance. Its one-sided coverage error 
is given by 

Pf{ 6(F) < X n + n - 1/2 d n zx- a } - (1 - a) 

= a- P F {n 1/2 (X n - 0(F))/& n < z a } . (15.50) 

By (15.46), the one-sided coverage error of this normal theory interval is 

-^'y(P) l fi(za)(‘2zl + 1 )n _1/2 + 0 (n -1 ) = Q(n~ 1/2 ) . 


(15.51) 
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Analogously, the coverage error of the two-sided confidence interval of nominal 
level 1 — 2 a, 

X n ± n - 1 // 2 cr„zi_Q: , (15.52) 

satisfies 

P F {- Zl - a < n 1/2 (X n - 9(F))/a n < zi- a } - (1 - 2a) 

= P{n 1/2 (X rl — 9(F))/ari < zi- a }~ P{n 1/2 (X n -9(F))a n < -zi_«}-(l-2a) , 
which by (15.46) is equal to 

[$(«!_„) + ^'y(F)ip(zi- a )(2z 2 _ a + l)n _1/2 + 0 (n -1 )] 

-[^(-zi- a )+^'i(F)ip(-z\- a )(2zl_ a + l)n~ 1/2 + 0(n~ 1 )\-(l-2a) = 0(n _1 ) , 

using the symmetry of the function ip. Thus, while the coverage error of the 
one-sided interval (15.49) is 0(n -1 ^ 2 ), the two-sided interval (15.52) has cov¬ 
erage error 0(n -1 ). The main reason the one-sided interval has coverage error 
0 (n -1 / 2 ) derives from the fact that a normal approximation is used for the dis¬ 
tribution of n 1 ' / 2 (X n — 9(F))/d n and no correction is made for skewness of the 
underlying distribution. For example, if 7 (F) > 0, the one-sided upper confi¬ 
dence bound (15.49) undercovers slightly while the one-sided lower confidence 
bound overcovers. The combination of overcoverage and undercoverage yields a 
net result of a reduction in the order of coverage error of two-sided intervals. Ana¬ 
lytically, this fact derives from the key property that the n -1 ^ 2 term in (15.46) is 
an even polynomial. (Note, however, that the one-sided coverage error is 0(n -1 ) 
if 'y(F) = 0.) These results are in complete analogy with the corresponding re¬ 
sults in Section 11.4.1 for error in rejection probability of tests of the mean based 
on the normal approximation. 

Basic Bootstrap Intervals. Next, we consider bootstrap confidence intervals for 
9(F) based on the root 

R n (X n , 9(F)) = n 1/2 (X n -9(F)) . (15.53) 

It is plausible that the bootstrap approximation J n (t,F n ) to J„(t, F) satisfies an 
expansion like (15.45) with F replaced by F n . In fact, it is the case that 

1 - t 2 

Jn(t,Fn) = $(t/a n ) - - 7 (F n )<p(V 0 vO(-ry - l)n -1/2 +Op(n -1 ) . (15.54) 

D <J n 

Both sides of (15.54) are random and the remainder term is now of order n _1 
in probability. Similarly, the bootstrap quantile function J„ 1 ( 1 — a,F n ) has an 
analogous expansion to (15.48) and is given by 

J~ x (l - a, F n ) = a n [ Zl - a + ^j(F n )(zf_ a - l)n -1/2 ] + Op(n -1 ) . (15.55) 

The validity of these expansions is quite technical and is proved in Hall (1992, Sec¬ 
tion 5.2), and a sufficient condition for them to hold is that F satisfies Cramer’s 
condition and has infinitely many moments; such assumptions will remain in force 
for the remainder of this section. From (15.45) and (15.54), it follows that 

Jn(t,F n ) - Jn(t,F) = Op(n~ 1/2 ) 
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because 

ct„ - a(F) = Op(n" 1/2 ) . 

Thus, the bootstrap approximation J n (t,F n ) to J n (t,F) has the same order of 
error as that provided by the normal approximation. 

Turning now to coverage error, consider the one-sided coverage error of the 
nominal level 1 — a upper confidence bound X n — n -1 ^ 2 J“ 1 (a, TVt), given by 

Pf{ 9(F) < X n - n- 1/2 J-\a,Fn)} - (1 - a) 

= a- P F {n 1/2 (X n - 9(F)) < J~\a,F n )} 

= a- P F {n 1/2 (X n - 9(F))/a n < z a + i j(F)(z 2 - 1 )n~ 1/2 + Op(n“ 1 )} 

= a - P F {n 1/2 (X n - 9(F))/a n < z a + i 7 ( F)(z 2 a - l)n“ 1/2 } + 0(n^) . 

The last equality, though plausible, requires a rigorous argument, but follows 
from Problem 15.36. The last expression, by (15.46) and a Taylor expansion, 
becomes 

-\l(F)ip(z a )z 2 a n~ 1/2 + 0(n _1 ) , 

so that the one-sided coverage error is of the same order as that provided by 
the basic normal approximation. Moreover, by similar reasoning, the two-sided 
bootstrap interval of nominal level 1 — 2a, given by 

[A'n — n _1/2 J~ 1 (l — a, F n ), X n — n _1/2 J~ 1 (a, F n )] , (15.56) 

has coverage error 0(n -1 ). Although these basic bootstrap intervals have the 
same orders of coverage error as those based on the normal approximation, there 
is evidence that the bootstrap does provide some improvement (in terms of the 
size of the constants); see Liu and Singh (1987). 

Bootstrap-t Confidence Intervals. Next, we consider bootstrap confidence inter¬ 
vals for 9(F) based on the studentized root 

K(X n , 9(F)) = n 1/2 (X n - 9(F))/a n , (15.57) 

whose distribution under F is denoted K n (-,F). The bootstrap versions of the 
expansions (15.46) and (15.47) are 

K n (t, F n ) = $(t) + ^(FnMt)(2t 2 + 1 )n~ 1/2 + Op (n" 1 ) (15.58) 

and 

A'“ 1 (l - a,F„) = zi- a - ^ri(F n )(2z\_ a + l)n _1/2 + 0 P (n~ x ) . (15.59) 

Again, these results are obtained rigorously in Hall (1992), and a sufficient con¬ 
dition for their validity is that F is absolutely continuous with infinitely many 
moments. By comparing (15.46) and (15.58), it follows that 

K n (t, F n ) - K n (t , F) = Op(n~ 1 ) , 


(15.60) 
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since 7 (F n ) — 7 (F) = Op(n 1//2 ). Similarly, 

K- 1 (l-a,F n )-K- 1 (l-a,F) = 0 P (n- 1 ) . (15.61) 

Thus, the bootstrap is more successful at estimating the distribution or quantiles 
of the studentized root than its nonstudentized version. 

Now, consider the nominal level 1 — a upper confidence bound X n — 
n~ 1 ^ 2 a„K~ 1 (a, F„). Its coverage error is given by 

Pf{ 9(F) <X n - n- 1/2 a n K-\a,F n )} - (1 - a) 

= a- P F {n 1/2 (X n - 9(F))/a n < K~\a, F n )} 

= a - P F {n /2 (X n - 9(F))/a n < z a - ^(F)(2z 2 a + 1 )n~ 1/2 + Optn' 1 )} , 

since (15.59) implies the same expansion for K^ 1 (a, F n ) with 7 (F n ) replaced by 
7 (F) (again using the fact that 7 (F n ) — 7 (F) = Op(n -1 ^ 2 )). By Problem 15.36, 
this last expression becomes 

a - P F {n 1/2 (X n - 9(F))/a n < z a - l'y(F)(2z 2 a + l)n- 1/2 } + ©(n" 1 ) . 

6 

Let 

tn = tn(a, F) = z a - ^(F)(2z 2 + 1 )n~ 1/2 , 

so that ( t n — z a ) = 0(n~ 1 ^ 2 ). Then, the coverage error becomes 

a - [3>(t„) + ^{F)<p(t n )(2tl + 1 )n~ 1/2 + 0 (n -1 )] . 

By expanding $ and tp about z a and combining terms that are 0(n -1 ), the last 
expression becomes 

a - $(z a ) - ( t„ - z a )ip(z a ) + 0 (n _1 ) 

“7 {.F)\ip(z a ) + (tn - Za)ip'(z a ) + 0(n~ 1 )](2zl + l)n _1/2 + 0(n _1 ) = 0(n _1 ) . 

Thus, the one-sided coverage error of the bootstrap-f interval is 0(n -1 ) and is 
of smaller order than that provided by the normal approximation or the boot¬ 
strap based on a nonstudentized root. Intervals with one-sided coverage error of 
order 0 (n -1 ) are said to be second-order accurate, while intervals with one-sided 
coverage error of order 0 (n -1 ^ 2 ) are only first-order accurate. 

A heuristic reason why the bootstrap based on the root (15.57) outperforms 
the bootstrap based on the root (15.53) is as follows. In the case of (15.53), the 
bootstrap is estimating a distribution that has mean 0 and unknown variance 
cr 2 (F). The main contribution to the estimation error is the implicit estimation 
of <t 2 (F) by a 2 (Fn). On the other hand, the root (15.57) has a distribution that 
is nearly independent of F since it is an asymptotic pivot. 

The two-sided interval of nominal level 1 — 2a, 

[X n - n- 1/2 a n K-\l - a, F n ), X n - n- 1,2 a n K~ l (a, F n )} , 


(15.62) 
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also has coverage error 0(n -1 ) (Problem 15.38). This interval was formed by 
combining two one-sided intervals. Instead, consider the absolute studentized 
root 

R t n (X n ,0(F)) = |n 1 / 2 (X n — 0(F))\/a n , 

whose distribution and quantile functions under F are denoted L n (t,F) and 
L~ x (1 — a, F), respectively. An alternative two-sided bootstrap confidence interval 
for 0(F) of nominal level 1 — a is given by 

X n ± n~ 1/2 a n L~ 1 (l - a, F n ) . 

Note that this interval is symmetric about X„. Its coverage error is actually 
0(n~ 2 ). The arguments for this claim are similar to the previous claims about 
coverage error, but more terms are required in expansions like (15.46). 

Bootstrap Calibration. By considering a studentized statistic, the bootstrap-f 
yields one-sided confidence intervals with coverage error smaller than the non- 
studentized case. However, except in some simple problems, it may be difficult to 
standardize or studentize a statistic because an explicit estimate of the asymptotic 
variance may not be available. An alternative approach to improving coverage 
error is based on the following calibration idea of Loh (1987). Let I„ = /„( 1 — a) 
be any interval with nominal level 1 — a, such as one given by the bootstrap, or 
a simple normal approximation. Its coverage is defined to be 

C n (l — a,F) = P f {0(F) 6 /„( 1 - a)} . 

We can estimate C„( 1 — a, F) by its bootstrap counterpart C„(l — a, F n ). Then, 
determine a„ to satisfy 

Cn( 1 Otm F n ) — 1 OL , 

so that a„ is the value that results in the estimated coverage to be the nominal 
level. The calibrated interval then is defined to be 7„(1 — d„). 

To fix ideas, suppose I n (l — a) is the one-sided normal theory interval 
(—oo,Xn + n - 1 / 2 d-T, 2 i-a]- We argued its coverage error is 0(n -1 ^ 2 ). More 
specifically, 

0,(1 — a, F) = P F {n 1/2 (X n - 0(F))/a n < z a } 

= 1 - a + ^ ip(z a )(2z 2 + l)n _1/2 + 0 (n -1 ) . 

Under smoothness and moment assumptions, the bootstrap estimated coverage 
satisfies 

Cn{ 1 - a, Fn) = 1 - a + ^(2c*)7(Fn)(22£ + 1 )n -1/2 + Op(n -1 ) , 

D 

and the value of a n is obtained by setting the estimated coverage equal to 1 — a. 
One can then show that 

&n - a = -i<p( z a )'y(F)(2z 2 a + l)n _1/2 + Op(n _1 ) . (15.63) 

By using this expansion and (15.46), it can be shown that the interval /„( 1 — a n ) 
has coverage 1 —a + 0(n _1 ), and hence is second-order accurate (Problem 15.39). 
Thus, calibration reduces the order of coverage error. 
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Other Bootstrap Methods. There are now many variations on the basic boot¬ 
strap idea that yield confidence regions that are second-order accurate, assuming 
the validity of Edgeworth Expansions like the ones used in this section. The 
calibration method described above is due to Loh (1987, 1991) and is essentially 
equivalent to Beran’s (1987, 1988) method of prepivoting (Problem 15.43). Given 
an interval 7„(1 — a) of nominal level 1 — a, calibration produces a new interval, 
say 7^(1 — a) = 7„(1 — d„), where a„ is chosen by calibration. It is tempting 
to iterate this idea to further reduce coverage error. That is, now calibrate to 
yield a new interval 7and so on. Further reduction in coverage error is indeed 
possible (at the expense of increased computational effort). For further details 
on these and other methods such as Efron’s BC a method, see Hall and Martin 
(1988), Hall (1992) and Efron and Tibshirani (1993). 

The analysis of this section was limited to methods for constructing confi¬ 
dence intervals for a mean, assuming the underlying distribution is smooth and 
has sufficiently many moments. But, many of the conclusions extend to smooth 
functions of means studied in Example 15.4.3. In particular, in order to reduce 
coverage error, it is desirable to use a root that is at least asymptotically pivotal, 
such as a studentized root that is asymptotically standard normal. Otherwise, the 
basic bootstrap interval (15.22) has the same order of coverage error as one based 
on approximating the asymptotic distribution. However, whether or not the root 
is asymptotically pivotal, bootstrap calibration reduces the order of coverage er¬ 
ror. Of course, some qualifications are necessary. For one, even in the context 
of the mean, Cramer’s condition may not hold, as in the context of a binomial 
proportion. Edgeworth expansions for such discrete distributions supported on 
a lattice are studied in Chapter 5 of Bhattacharya and Rao (1976) and Kolassa 
and McCullagh (1990); also see Brown, Cai and DasGupta (2001), who study the 
binomial case. In other problems where smoothness is assumed, such as inference 
for a density or quantiles, Edgeworth expansions for appropriate statistics behave 
somewhat differently than they do for a mean. Such problems are treated in Hall 
(1992). 


15.6 Hypothesis Testing 

In this section, we consider the use of the bootstrap for the construction of hy¬ 
pothesis tests. Assume the data A'” is generated from some unknown law P. 
The null hypothesis 77 asserts that P belongs to a certain family of distributions 
Po, while the alternative hypothesis K asserts that P belongs to a family Pi. 
Of course, we assume the intersection of Po and Pi is the empty set, and the 
unknown law P belongs to P, the union of Po and Pi. 

There are several approaches one can take to construct a hypothesis test. First, 
consider the case when the null hypothesis can be expressed as a hypothesis about 
a real- or vector-valued parameter 9(P). Then, one can exploit the familiar duality 
between confidence regions and hypothesis tests to test hypotheses about 9(P). 
Thus, a consistent in level test of the null hypothesis that 9(P) = 9q can be 
constructed by a consistent in level confidence region for 9{P) by the rule: accept 
the null hypothesis if and only if the confidence region includes 9q. Therefore, 
all the methods we have thus far discussed for constructing confidence regions 
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may be utilized: methods based on a pivot, an asymptotic pivot, an asymptotic 
approximation, or the bootstrap. Indeed, this was the bootstrap approach already 
considered in Corollary 15.4.1, and it was also the basis for the multiple test 
construction in Section 15.4.4. 

However, not all hypothesis testing problems fit nicely into the framework of 
testing parameters. For example, consider the problem of testing whether the 
data come from a certain parametric submodel (such as the family of normal 
distributions) of a nonparametric model, the so-called goodness of fit problem. 
Or, when Xi is vector-valued, consider the problem of testing whether X\ has a 
distribution that is spherically symmetric. 

Given a test statistic T n , its distribution must be known, estimated, or approx¬ 
imated (at least under the null hypothesis), in order to construct a critical value. 
The approach taken in this section is to estimate the null distribution of T n by 
resampling from a distribution obeying the constraints of the null hypothesis. 

To be explicit, assume we wish to construct a test based on a real-valued test 
statistic T n = T„(X") which is consistent in level and power. Large values of T„ 
reject the null hypothesis. Thus, having picked a suitable test statistic T n , our 
goal is to construct a critical value, say c n (l — a), so that the test which rejects 
if and only if T n exceeds c„(l — a) satisfies 

P{T n (X n ) > c n ( 1 — a)} ->«asn->oo 

when P £ Po- Furthermore, we require this rejection probability to tend to one 
when P £ Pi. Unlike the classical case, the critical value will be constructed to be 
data-dependent (as in the case of a permutation test). To see how the bootstrap 
can be used to determine a critical value, let the distribution of T n under P be 
denoted by 

G n (x,P) = P{T n (X n ) < x} . 

Note that we have introduced G„(-,P) instead of utilizing J n (-,P) to dis¬ 
tinguish from the case of confidence intervals where J„(-,P) represents the 
distribution of a root which may depend both on the data and on P. In the hy¬ 
pothesis testing context, G n (-,P) represents the distribution of a statistic (and 
not a root) under P. Let 

g„(l - a,P) = inf {a: : G n (x, P) > 1 - a} . 

Typically, G n (-,P ) will converge in distribution to a limit law G(-,P), whose 
1 — a quantile is denoted g( 1 — a, P). 

The bootstrap approach is to estimate the null sampling distribution by 
Gn (• , Qn ) , where Q n is an estimate of P in Po so that Q n satisfies the con¬ 
straints of the null hypothesis, since critical values should be determined as if 
the null hypothesis were true. A bootstrap critical value can then be defined by 
g„( 1 — a , Q n ). The resulting nominal level a bootstrap test rejects H if and only 
if T n > g n ( 1 - a, Q„). 

Notice that we would not want to replace a Q n satisfying the null hypothe¬ 
sis constraints by the empirical distribution function P n , the usual resampling 
mechanism of resampling the data with replacement. One might say that the 
bootstrap is so adept at estimating the distribution of a statistic that G„(-,P n ) 
is a good estimate of G n (-, P) whether or not P satisfies the null hypothesis con¬ 
straints. Hence, the test that rejects when T n exceeds <?„(! — a,P n ) will (under 
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suitable conditions) behave asymptotically like the test that rejects when T n ex¬ 
ceeds g n ( 1 — a, P), and this test has an asymptotic probability of a of rejecting 
the null hypothesis, even if P £ Pi. But, when P £ Pi, we would want the test 
to reject with probability that is approaching one. 

Thus, the choice of resampling distribution Q„ should satisfy the following. If 
P £ Po, Qn should be near P so that G n (-,P) « G n (-,Qn ); then, g n (l — a,P) « 
g„( 1 — a,Q„) and the asymptotic rejection probability approaches a. If, on the 
other hand, P £ Pi, Q„ should not approach P, but some Po in Po- In this way, 
the critical value should satisfy 

g n { 1 - a, Q n ) « g n { 1 - a, P 0 ) ->■ g( 1 - a, P 0 ) < oo 

as n —> oo. Then, assuming the test statistic is constructed so that T n —» oo 
under P when P 6 Pi, we will have 

P{T n > g n ( 1 — a, Q n )} « P{T„ > g( 1 — a, Po)} -t 1 
as n —» oo, by Slutsky’s Theorem. 

As in the construction of confidence intervals, G n (-, P) must be smooth in P in 
order for the bootstrap to succeed. In the theorem below, rather than specifying 
a set of sequences Cp as was done in Theorem 15.4.1, smoothness is described in 
terms of a metric d, but either approach could be used. The proof is analogous 
to the proof of Theorem 15.4.1. 

Theorem 15.6.1 Let X" be generated from a probability law P £ Po- Assume 
the following triangular array convergence: d(P n ,P) —> 0 and P £ Po implies 
G n (-,Pn) converges weakly to G(-,P) with G(-, P) continuous. Moreover, assume 
Q n is an estimator of P based on X n which satisfies d(Q n , P) —> 0 in probability 
whenever P £ Po- Then, 

P{T n > g n ( 1 — a, Q n )} -> a as n -> oo . 

Example 15.6.1 (Normal Correlation) Suppose (Y),Zj), i = 1 are 

i.i.d. bivariate normal with unknown means, variances, and correlation p. The 
null hypothesis specifies p = po versus p > po- Let T n = n x ^pn, where p„ is 
the usual sample correlation. Under the null hypothesis, the distribution of T n 
doesn’t depend on any unknown parameters. So, if Qn is any bivariate normal 
distribution with p = po, the bootstrap sampling distribution G n (-,Qn) is ex¬ 
actly equal to the true null sampling distribution. Note, however, that inverting 
a parametric bootstrap confidence bound using the root — p) would not 

be exact. ■ 

Example 15.6.2 (Likelihood Ratio Tests) Suppose Xi ,..., X n are i.i.d. ac¬ 
cording to a model {Pg, 9 £ fl}, where 11 is an open subset of Ht fe . Assume 
6 is partitioned as (£,p), where £ is a vector of length p and p is a vector of 
length k — p. The null hypothesis parameter space flo specifies £, = £o- Under 
the conditions of Theorem 12.4.2, the likelihood ratio statistic T„ = 21og(7? n ) 
is asymptotically Xp under the null hypothesis. Suppose (£o,A™,o) is an efficient 
likelihood estimator of 9 for the model 12o- Rather than using the critical value 
obtained from Xp, on e could bootstrap T n . So, let G n (x,9) denote the distribu¬ 
tion of T n under 9. An appropriate parametric bootstrap test obeying the null 



15.6. Hypothesis Testing 671 

hypothesis constraints is to reject the null when T n exceeds the 1 — a quantile of 
G n (x, (Co, An.o))- Beran and Ducharme (1991) argue that, under regularity condi¬ 
tions, the bootstrap test has error in rejection probability equal to 0 (n -2 ), while 
the usual likelihood ratio test has error 0(n _1 ). Moreover, the bootstrap test can 
be viewed as an analytical approximation to a Bartlett-corrected likelihood ratio 
test (see Section 12.4.4). In essence, the bootstrap automatically captures the 
Bartlett correction and avoids the need for analytical calculation. As an exam¬ 
ple, recall Example 12.4.7, where it was observed the Bartlett-corrected likelihood 
ratio test has error 0{n~ 2 ). Here, the bootstrap test is exact (Problem 15.45). ■ 

Example 15.6.3 (Behrens-Fisher Problem Revisited) For j = 1,2, let 
Xij, % = 1 ,rij be independent with Xij distributed as N(pj,a 2 ). All four 
parameters are unknown and vary independently. The null hypothesis asserts 
p i = /i 2 and the alternative is pi > p 2 - Let n = n i + ri 2 , and for simplicity 
assume m to be the integer part of An for some 0 < A < 1. Let ( X n j , S 2 j) be 
the usual unbiased estimators of (pj,a 2 ) based on the jth sample. Consider the 
test statistic 

T n = (Ai - X 2 )/\l ^ +"^ • 

y ni ri2 

By Example 13.5.4, the test that rejects the null hypothesis when T„ > zi- a is 
efficient. However, we now study its actual rejection probability. 

The null distribution of T n depends only on <r 2 = (a 2 , a 2 ) through the ratio 
cti/(T 2 , and we denote this distribution by G n (-,cr 2 ). Let S 2 = (S 2 ^, S 2 ^)- Like 
the method used in Problem 11.89, by conditioning on S' 2 , we can write 

G n (x,a 2 ) = E[a(S 2 ,a 2 ,x)\ , 

where 

a(S 2 ,a 2 ,x) = 4>[(1 + <5) 1 / 2 a:] 


5 = J2 n i 1 ( S n,j - <?j)/ J2 n i 1<j2 i ' 

3 =1 3 = 1 

By Taylor expansion and the moments of S 2 , it follows that (Problem 15.46) 

Gn(x, a 2 ) = 4>(a;) + — b n (x , a 2 ) + 0(n~ 2 ) , (15.64) 

n 

where 

-b n (x, a 2 ) = —(x + x 3 )(j>{x)p 2 n / 4 
n 

is 0(n _1 ) and 

pl = ~ 1 )~ ln 7 2<T l/(5Z n i" 1<T l) 2 • 

3 = 1 3 = 1 

Correspondingly, the quantile function satisfies 

Gn 1 )! — a, a 2 ) = zi-a + ( zi-a + z\- a )pl /4 + 0(n 2 ) . 


(15.65) 



672 15. General Large Sample Methods 


It follows that the rejection probability of the asymptotic test that rejects when 
T n > zi-c is a + 0(n~ 1 ). 

Consider next the (parametric) bootstrap-t, which rejects when T n > G^ 1 ( 1 — 
a, S' 2 ). Its rejection probability can be expressed as 

l-E[a{S 2 n ,a 2 ,G-\l-a,Sl))} . 

By Taylor expansion, it can be shown that the rejection probability of the test is 
a + 0(n~ 2 ) (Problem 15.47). Thus, the bootstrap-t improves upon the asymp¬ 
totic expansion. In fact, bootstrap calibration (or the use of prepivoting) further 
reduces the error in rejection probability to 0(n~ 3 ). Details are in Beran (1988), 
who further argues that the Welch method described in Section 11.3.1 behaves 
like the bootstrap-t method. Although the Welch approximation is based on ele¬ 
gant mathematics, the bootstrap approach essentially reproduces the analytical 
approximation automatically. ■ 


Example 15.6.4 (Nonparametric Mean) Let Xi,..., X n be i.i.d. observa¬ 
tions on the real line with probability law P, mean /x(P) and finite variance 
<r 2 (P). The problem is to test /r(P) = 0 against either a one-sided or two-sided 
alternative. So, Po is the set of distributions with mean zero and finite variance. 
In the one-sided case, consider the test statistic T„ = r\}^ 2 X n , where X n is the 
sample mean, since test statistics based on X n were seen in Section 11.4 to pos¬ 
sess a certain optimality property. We will also consider the studentized statistic 
T' n = n 1 ^ 2 X„/Sn, where we shall take S 2 to be the unbiased estimate of variance. 
To apply Theorem 15.6.1, let Q n be the empirical distribution P„ shifted by X n 
so it has mean 0. Then, the error in rejection probability will be 0(n -1 ^ 2 ) for T n , 
and will be 0(n -1 ) for T„, at least under the assumptions that P is smooth and 
has infinitely many moments; these statements follow from the results in Section 
15.5 (Problem 15.49). 

While shifting the empirical distribution works in this example, it is not easy 
to generalize when testing other parameters. Therefore, we consider the following 
alternative approach. The idea is to choose the distribution in Po that is in some 
sense closest to the empirical distribution P n . One way to describe closeness is 
the following. For distributions P and Q on the real line, let 8kl{P,Q) be the 
(forward) Kullback-Leibler divergence between P and Q (studied in Example 
11.2.4), defined by 

7 dP 

8kl(P,Q) = J log( — )dP . (15.66) 

Note that Skl{P , Q) may be oo, Skl is not a metric, and it is not even sym¬ 
metric in its arguments. Let Q n be the Q that minimizes <5 kl{Pu,Q) over Q 
in P 0 . This choice for Q n can be shown to be well-defined and corresponds to 
finding the nonparametric maximum likelihood estimator of P assuming P is con¬ 
strained to have mean zero. (Another possibility is to minimize the (backward) 
Kullback-Leibler divergence 5kl(Q, Pn)-) By Efron (1981) (Problem 15.50), Q n 
assigns mass Wi to Xi, where Wi satisfies 

(1 + tXi)- 1 


Wi oc 
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and t is chosen so that WiXi = 0. Now, one could bootstrap either T n or 

T' n from Q„. 

In fact, this approach suggests an alternative test statistic given by T” = 
n5KL(Pn,Qn), where Q„ is the Q minimizing the Kullback-Leibler divergence 
SKL{Pn,Q) over Q in Po- This is equivalent to the test statistic used by 
Owen (1988, 2001) in his construction of empirical likelihood, who shows the 
limiting distribution of 2 T” under the null hypothesis is Chi-squared with 1 de¬ 
gree of freedom. The wide scope of empirical likelihood is presented in Owen 
( 2001 ). ■ 


Example 15.6.5 (Goodness of fit) The problem is to test whether the under¬ 
lying probability distribution P belongs to a parametric family of distributions 
Po = {Pg,9 £ ©o}, where ©o is an open subset of ^-dimensional Euclidean 
space. Let P n be the empirical measure based on X \,..., X n . Let 6 n £ ©o be an 
estimator of 9. Consider the test statistic 

T n = n 1 ^ 2 S(P n , Pg n ) , 

where S is some measure (typically a metric) between P n and Pg . (In fact, <5 
need not even be symmetric, which is useful sometimes: for example, consider 
the Cramer-von Mises statistic.) Beran (1986) considers the case where 9 n is 
a minimum distance estimator, while Romano (1988) assumes that 9 n is some 
asymptotically linear estimator (like an efficient likelihood estimator). For the 
resampling mechanism, take Q n = P§ . Both Beran (1986) and Romano (1988) 
give different sets of conditions so that the above theorem is applicable, both 
requiring the machinery of empirical processes. ■ 


15.7 Subsampling 

In this section, a general theory for the construction of approximate confidence 
sets or hypothesis tests is presented, so the goal is the same as that of the boot¬ 
strap. The basic idea is to approximate the sampling distribution of a statistic 
based on the values of the statistic computed over smaller subsets of the data. For 
example, in the case where the data are n observations which are independent 
and identically distributed, a statistic 9 n is computed based on the entire data 
set and is recomputed over all ()() data sets of size b. Implicit is the notion of 
a statistic sequence, so that the statistic is defined for samples of size n and b. 
These recomputed values of the statistic are suitably normalized to approximate 
the true sampling distribution. 

This approach based on subsamples is perhaps the most general one for ap¬ 
proximating a sampling distribution, in the sense that consistency holds under 
extremely weak conditions. That is, it will be seen that, under very weak as¬ 
sumptions on b, the method is consistent whenever the original statistic, suitably 
normalized, has a limit distribution under the true model. The bootstrap, on 
the other hand, requires that the distribution of the statistic is somehow locally 
smooth as a function of the unknown model. In contrast, no such assumption 
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is required in the theory for subsampling. Indeed, the method here is applica¬ 
ble even in the several known situations which represent counterexamples to the 
bootstrap. However, when both subsampling and the bootstrap are consistent, 
the bootstrap is typically more accurate. 

To appreciate why subsampling behaves well under such weak assumptions, 
note that each subset of size b (taken without replacement from the original 
data) is indeed a sample of size b from the true model. If b is small compared 
to n (meaning 6 /n — > 0), then there are many (namely (£)) subsamples of size b 
available. Hence, it should be intuitively clear that one can at least approximate 
the sampling distribution of the (normalized) statistic 9b by recomputing the 
values of the statistic over all these subsamples. But, under the weak convergence 
hypothesis, the sampling distributions based on samples of size b and n should 
be close. The bootstrap, on the other hand, is based on recomputing a statistic 
over a sample of size n from some estimated model which is hopefully close to 
the true model. 

The use of subsample values to approximate the variance of a statistic is well- 
known. The Quenouille-Tukey jackknife estimates of bias and variance based on 
computing a statistic over all subsamples of size n — 1 has been well-studied and 
is closely related to the mean and variance of our estimated sampling distribution 
with b = n— 1. For further history of subsampling methods, see Politis, Romano, 
and Wolf (1999). 

15.7.1 The Basic Theorem in the I.I.D. Case 

Suppose X\.... ,X n is a sample of n i.i.d. random variables taking values in an 
arbitrary sample space S. The common probability measure generating the ob¬ 
servations is denoted P. The goal is to construct a confidence region for some 
parameter 9(P). For now, assume 9 is real-valued, but this can and will be general¬ 
ized to allow for the construction of confidence regions for multivariate parameters 
or confidence bands for functions. 

Let 0 n = 9 n { AT, ..., X n ) be an estimator of 9(P). It is desired to estimate the 
true sampling distribution of 9 n in order to make inferences about 9(P). Nothing 
is assumed about the form of the estimator. 

As in previous sections, let J n {P) be the sampling distribution of the root 
T n (0n — 9(P)) based on a sample of size n from P, where t„ is a normalizing 
constant. Here, t„ is assumed known and does not depend on P. Also define the 
corresponding cumulative distribution function: 

J n {x,P) = P{r„[0 n (X 1 ,...,X n )-9{P)]<x} . 

Essentially, the only assumption that we will need to construct asymptotically 
valid confidence intervals for 9(P) is the following. 

Assumption 15.7.1 There exists a limiting distribution J(P) such that J„(P) 
converges weakly to J(P) as n —» oo. 

This assumption will be required to hold for some sequence t„. The most 
informative case occurs when r n is such that the limit law J{P) is nondegenerate. 

To describe the subsampling method, consider the N n = (£) subsets of size b of 
the data { X \,..., X call them Yi,..., Yn„ , ordered in any fashion. Thus, each 
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Yi constitutes a sample of size b from P. Of course, the Yi depend on b and n, 
but this notation has been suppressed. Only a very weak assumption on b will be 
required. In the consistency results that follow, it will be assumed that b/n —> 0 
and b —> oo as n —> oo. Now, let O n ,b,i be equal to the statistic 6b evaluated at 
the data set Yi. The approximation to J n (x, P) we study is defined by 

N n 

Ln,b(x) = N^ 1 I{Tb(6n,b,i ~ On) < x} . (15.67) 

i= 1 

The motivation behind the method is the following. For any i, Y is actually 
a random sample of b i.i.d. observations from P. Hence, the exact distribution of 
Tb{0n,b,i — 0(P)) is Jb(P). The empirical distribution of the N n values of Tb{O n ,b,i — 
0(P)) should then serve as a good approximation to J n (P). Of course, 9(P) is 
unknown, so we replace 0(P) by 0 n , which is asymptotically permissible because 
Tb(0 n — 0(P)) is of order Tb/r n , and Tb/T n will be assumed to tend to zero. 


Theorem 15.7.1 Suppose Assumption 15.7.1 holds. Also, assume Tb/r n —> 0, 
b —» oo, and b/n —» 0 as n —» oo. 


(i) If x is a continuity point of J(-,P), then L n ,bix) J(x,P) in probability. 

(ii) If J(-,P) is continuous, then 

sup \L n , b (x) — J n (x, P)| —¥ 0 in probability . (15.68) 

X 

(in) Let 


c n ,b( 1 — a) = inf{a; : L n ,b{x) > 1 — a} . 


and 


c(l — a, P) — infja: : J(x, P) > 1 — a} . 

If J(-, P) is continuous at c(l — a,P), then 

P{Tn[dn — 0(P)] < Cn,b(i ~ «)} -> 1 - B OS tl -> OO . (15.69) 

Therefore, the asymptotic coverage probability under P of the confidence 
interval [6„ — Tn 1 c n , s(l — a),oo) is the nominal level 1 — a. 


Proof. Let 

Nn 

Unix) = Un,b(x, P) = Nn 1 l{n[0n,b,i - 0{P)] < x} . (15.70) 

i= 1 

Note that the dependence of U n (x) on b and P will now be suppressed for nota- 
tional convenience. To prove (i), it suffices to show U n (x) converges in probability 
to J(x, P ) for every continuity point x of J(x , P). To see why, note that 

Ln,b(x) = Nn 1 I{rb[6n,b,i ~ 9(P)\ + T b [0(P) - 6 n ] < X } , 

i 

so that for every e > 0, 

Un{x - e)I{E n } < Ln,bix)I{E n } < Unix + e)I{E„} , 
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where I{E n } is the indicator of the event E„ = (t6 |6 i (P) — 9 n \ < e}. But, the 
event E n has probability tending to one. So, with probability tending to one, 

Un(x e) *7 L n b{x ) 7 Un(x “l - e) 

for any e > 0. Hence, if x + e and x — e are continuity points of J(-,P), then 
U n {x ± e) —» J(x ± e, P) in probability implies 

J(x — e, P) — e < L n ,b(x) < J(x + e,P) + e 

with probability tending to one. Now, let e —>■ 0 so that * ± t are continu¬ 
ity points of J(-,P). Then, it suffices to show U n (x) J[x,P) in probability 
for all continuity points x of J(-,P). But, 0 < U n (x) < 1 and E[U n (x)\ = 
J b (x,P). Since Jb[x,P) —» J(x,P), it suffices to show Var[U n (x)\ —> 0. To 
this end, suppose k is the greatest integer less than or equal to n/b. For 
j = 1 let R n ,b,j be equal to the statistic 9b evaluated at the data set 

&b(Xb(j- i)+i, Xb(j-i)+ 2 , •. •, X Hj _ 1)+b ) and set 

k 

U n (x) = k-^HMRn.bJ ~ 9{P)] < X} . 

3 =1 

Clearly, U n (x ) and U n {x) have the same expectation. But, since U n (x) is the 
average of k i.i.d. variables (each of which is bounded between 0 and 1 ), it follows 
that 

Var[U n (x)\ < -> 0 

4 K 

as n — v oo. Intuitively, U n (x) should have a smaller variance than Un(x), because 
Un(x) uses the ordering in the sample in an arbitrary way. Formally, we can write 

U n (x) = E[U n {x) |X„] , 

where X n is the information containing the original sample but without regard 
to their order. Applying the inequality [P(T)j 2 < E(Y 2 ) (conditionally) yields 

E[U 2 (x)} = E{E[U n (x) |X n ]} 2 < {E[U 2 (x)\X n ]} = E[U 2 (x)] . 

Thus, Var\Un{x)\ 0 and (i) follows. 

To prove (ii), given any subsequence {nu\, one can extract a further subse¬ 
quence {nkj} so that L nic (x) -¥ J(x,P ) almost surely. Therefore, L nk (x) —> 
J(x,P) almost surely for all x in some countable dense set of the real line. So, 
Ln k tends weakly to J(x, P) and this convergence is uniform by Polya’s Theorem. 
Hence, the result (ii) holds. 

p 

To prove (iii), c„,t( 1 — a) —» c(l — a,P) by Lemma 11.2.1 (ii). The limiting 
coverage probability now follows from Slutsky’s Theorem. ■ 

The assumptions b/n —> 0 and b —» oo need not imply Tb/r n —> 0. For example, 
in the unusual case t„ = log(n), if b = n 1 and 7 > 0 , the assumption T b /T n —¥ 0 
is not satisfied. In fact, a slight modification of the method is consistent without 
assuming T b /T„ —» 0; see Politis, Romano, and Wolf (1999), Corollary 2.2.1. In 
regular cases, t„ = n 1 ^ 2 , and the assumptions on b simplify to b/n 0 and 
b —> oo. 

The assumptions on b are as weak as possible under the weak assumptions of 
the theorem. However, in some cases, the choice b = 0(n ) yields similar results; 
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this occurs in Wu (1990), where the statistic is approximately linear with an 
asymptotic normal distribution and t„ = n 1 ^ 2 . This choice will not work in 
general; see Example 15.7.2. 

Assumption 15.7.1 is satisfied in numerous examples, including all previous 
examples considered by the bootstrap. 


15.7.2 Comparison with the Bootstrap 

The usual bootstrap approximation to J n (x,P) is J n {x,Q n ), where Q n is some 
estimate of P. In many nonparametric i.i.d. situations, Qn is taken to be the em¬ 
pirical distribution of the sample X \,..., X n . In Section 15.4, we proved results to 
(15.68) and (15.69) with L n ,b{x) replaced by J n (x,Q n ). While the consistency of 
the bootstrap requires arguments specific to the problem at hand, the consistency 
of subsampling holds quite generally. 

To elaborate a little further, we proved bootstrap limit results in the following 
manner. For some choice of metric (or pseudo-metric) d on the space of probability 
measures, it must be known that d(P n ,P) —> 0 implies J n (Pn) converges weakly 
to J(P). That is, Assumption 15.7.1 must be strengthened so that the convergence 
of J n {P ) to J(P) is suitably locally uniform in P. In addition, the estimator Q n 
must then be known to satisfy d(Q n ,P) —> 0 almost surely or in probability 
under P. In contrast, no such strengthening of Assumption 15.7.1 is required in 
Theorem 15.7.1. In the known counterexamples to the bootstrap, it is precisely a 
certain lack of uniformity in convergence which leads to failure of the bootstrap. 

In some special cases, it has been realized that a sample size trick can often 
remedy the inconsistency of the bootstrap. To describe how, focus on the case 
where Q„ is the empirical measure, denoted by P„. Rather than approximating 
J n (P ) by Jn(Pn), the suggestion is to approximate J n (P) by Jb(P n ) for some b 
which usually satisfies b/n —> 0 and b —> oo. The resulting estimator J b (x, P n ) is 
obviously quite similar to our L n ,b{x) given in (2.1). In words, Jb(x,P„) is the 
bootstrap approximation defined by the distribution (conditional on the data) 
of Tb[6b{Xl ,... ,X£) — 9 n ], where XI ,..., X^ are chosen with replacement from 
A'i,..., X n . In contrast, L n ,b{x) is the distribution (conditional on the data) of 
T b [6b{Yi ,..., Yj,*) — 0 n )\, where Yf ,..., are chosen without replacement from 
Xi, ..., X n . Clearly, these two approaches must be similar if b is so small that 
sampling with and without replacement are essentially the same. Indeed, if one 
resamples b numbers (or indices) from the set {1,... , n}, then the chance that 
none of the indices is duplicated is Tl)r^ (1 — -). This probability tends to 0 if 
b 2 /n —> 0. (To see why, take logs and do a Taylor expansion analysis.) Hence, the 
following is true. 

Corollary 15.7.1 Under the further assumption that b 2 /n —» 0, parts (i)-(iii) of 
Theorem 15.7.1 remain valid if L n , b (x) is replaced by the bootstrap approximation 
Jb{x , Pn') - 

The bootstrap approximation with smaller resample size, Jb(Pn), is further stud¬ 
ied in Bic.kel, Gotze, and van Zwet (1997). In spite of the Corollary, we point 
out that L U: b is more generally valid. Indeed, without the assumption fe 2 /n —>■ 0, 
Jb(x,P n ) can be inconsistent. To see why, let P be any distribution on the real 
line with a density (with respect to Lebesgue measure). Consider any statistic 6 n , 
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T n , and 9(P) satisfying Assumption 15.7.1. Even the sample mean will work here. 
Now, modify 9 n to 9 n so that the statistic 9 n (X i,..., X„) completely misbehaves 
if any pair of the observations Xi,... ,X n are identical. The bootstrap approxi¬ 
mation to the distribution of 9 n must then misbehave as well unless b 2 /n —» 0, 
while the consistency of L ni b remains intact. 

The above example, though artificial, was designed to illustrate a point. We 
now consider some further examples. 

Example 15.7.1 (U-statistics of Degree 2) Let X\,... ,X n be i.i.d. on the 
line with c.d.f. F. Denote by F n the empirical distribution of the data. Let 

0{F) = J j u(x,y)dF(x)dF(y) 

and assume ui(x, y) = u>(y, x). Assume J u> 2 (x, y)dF(x)dF(y) < oo. Set t„ = n 1 ^ 2 
and 9 n = JT . u (Xi, Xj)/(£) ■ Then, it is well known that J„(F) converges 
weakly to J(F), the normal distribution with mean 0 and variance given by 

v 2 (F) =4 jy ]u } (x,y)dF(y)] 2 dF(x) - 9 2 (F) J. 

Hence, assumption 15.7.1 holds. However, in order for the bootstrap to succeed, 
the additional condition f uj 2 (x,x)dF(x) < oo is required. Bickel and Freedman 
(1981) give a counterexample to show the inconsistency of the bootstrap without 
this additional condition. 

Interestingly, the bootstrap may fail even if j ui 2 (x,x)dF(x) < oo, stemming 
from the possibility that v 2 (F) = 0. (Otherwise, Bickel and Freedman’s argument 
justifies the bootstrap.) As an example, let w(x, y) = xy. In this case, 9(F n ) = 
X 2 — S 2 /n , where S 2 is the usual unbiased sample variance. If 9(F) = 0, then 
v(F) = 0. Then, n[9(F„) — 9(F)] converges weakly to a 2 (F)(Z 2 — 1), where Z 
denotes a standard normal random variable and u 2 (F) denotes the variance of F. 
However, it is easy to see that the bootstrap approximation to the distribution 
of n[9(F n ) — 9(F)] has a representation a 2 (F)Z 2 +2Zcr(F)n 1 ^ 2 X rl . Thus, failure 
of the bootstrap follows. 

In the context of U-statistics, the possibility of using a reduced sample size 
in the resampling has been considered in Bretagnolle (1983); an alternative 
correction is given by Arcones (1991). ■ 

Example 15.7.2 (Extreme Order Statistic) The following counterexample 
is taken from Bickel and Freedman (1981). If Xi,... ,X n are i.i.d. according to 
a uniform distribution on (0,0), let Xr n \ be the maximum order statistic. Then, 
n[X( n ) — 9] has a limit distribution given by the distribution of —9X, where X is 
exponential with mean one. Hence, Assumption 15.7.1 is satisfied here. However, 
the usual bootstrap fails. To see why, let Xi ,..., X* be n observations sampled 
from the data with replacement, and let A'* n j be the maximum of the bootstrap 
sample. The bootstrap approximation to the distribution of n[X(„) — 9] is the 
distribution of n[X— A( n j], conditional on AT,... ,X n . But, the probability 
mass at 0 for this bootstrap distribution is the probability that A* n ) = X( n ), 
which occurs with probability 1 — (1 — ^) n —¥ 1 — exp(l). However, the true 
limiting distribution is continuous. Note in Theorem 15.7.1 that the conditions 
on b (with r n = n) reduce to b/n —> 0 and b —» oo. In this example, at least, it is 
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clear that we cannot assume b/n —» c, where c > 0. Indeed, L ni b(x) places mass 
b/n at 0. Thus, while it is sometimes true that, under further conditions such as 
Wu (1990) assumes, we can take b to be of the same order as n, this example 
makes it clear that we cannot in general weaken our assumptions on b without 
imposing further structure. ■ 

Example 15.7.3 (Superefficient Estimator) Assume Xi,.., ,X n are i.i.d. 

according the normal distribution with mean 9(P) and variance one. Fix c > 0. 
Let 6n = cX n if \X n \ < n -1 ^ 4 and 9 n = X n otherwise. The resulting estimator 
is known as Hodges’ superefficient estimator; see Lehmann and Casella (1998), 
p.440 and Problem 12.66. It is easily checked that n 1/l2 (9 n —9{P )) has a limit dis¬ 
tribution for every 9, so the conditions for our Theorem 15.7.1 remain applicable. 
However, Beran (1984) showed that the distribution of n}' 2 (6 n — 9(P)) cannot 
be bootstrapped, even if one is willing to apply a parametric bootstrap! ■ 

We have claimed that subsampling is superior to the bootstrap in a first or¬ 
der asymptotic sense, since it is more generally valid. However, in many typical 
situations, the bootstrap is far superior and has some compelling second-order 
asymptotic properties. Some of these were studied in Section 15.5; also see Hall 
(1992). In nice situations, such as when the statistic or root is a smooth func¬ 
tion of sample means, a bootstrap approach is often very satisfactory. In other 
situations, especially those where it is not known that the bootstrap works even 
in a first-order asymptotic sense, subsampling is preferable. Still, in other situa¬ 
tions (such as the mean in the infinite variance case), the bootstrap may work, 
but only with a reduced sample size. The issue becomes whether to sample with 
or without replacement (as well as the choice of resample size). Although this 
question is not yet answered unequivocally, some preliminary evidence in Bickel 
et al. (1997) suggests that the bootstrap approximation Jb(x, P n ) might be more 
accurate; more details on the issue of higher-order accuracy of the subsampling 
approximation L n ^{x) are given in Chapter 10 of Politis, Romano, and Wolf 
(1999). 

Because (£) can be large, L Ut b may be difficult to compute. Instead, an approx¬ 
imation may be employed. For example, let Ji,.../b be chosen randomly with 
or without replacement from (1, 2,.. ., N n }. Then, L n ,b(x) may be approximated 
by 

I s 

L n ,b(x) = — I{ T b(0n,b,Ii ~ On) < x ). (15.71) 

i= 1 

Corollary 15.7.2 Under the assumptions of Theorem 15.7.1 and the assumption 
B —> oo as n —» oo, the results of Theorem 15.7.1 are valid if L„ t b(x) is replaced 
by Ln,b(x). 

Proof. If the Ii are sampled with replacement, sup^. |L„^(x) — L n ^{x)\ —> 0 
in probability by the Dvoretzky, Kiefer, Wolfowitz inequality. This result is also 
true in the case the 7 ( are sampled without replacement; apply Proposition 4.1 
of Romano (1989b). ■ 

An alternative approach, which also requires fewer computations, is the follow¬ 
ing. Rather than employing all ()() subsamples of size b from X \,..., X n , just 
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use the n — b+1 subsamples of size b of the form {X;, Xj+ 1 ,..., Xi+b-i}■ Notice 
that the ordering of the data is fixed and retained in the subsamples. Indeed, 
this is the approach that is applied for time series data; see Chapter 3 of Politis, 
Romano and Wolf (1999), where consistency results in data-dependent situations 
are given. Even when the i.i.d. assumption seems reasonable, this approach may 
be desirable to ensure robustness against possible serial correlation. Most infer¬ 
ential procedures based on i.i.d. models are simply not valid (i.e., not even first 
order accurate) if the independence assumption is violated, so it seems worth¬ 
while to account for possible dependencies in the data if we do not sacrifice too 
much in efficiency. 


15.7.3 Hypothesis Testing 

In this section, we consider the use of subsampling for the construction of hy¬ 
pothesis tests. As before, Xi,. .., X n is a sample of n independent and identically 
distributed observations taking values in a sample space S. The common unknown 
distribution generating the data is denoted by P. This unknown law P is assumed 
to belong to a certain class of laws P. The null hypothesis H asserts P £ Po, 
and the alternative hypothesis K is P £ Pi, where Pi C P and Po U Pi = P- 
The goal is to construct an asymptotically valid test based on a given test 
statistic, 

T n = T n t n {X 1, . . . , X„) , 

where, as before, r n is a fixed nonrandom normalizing sequence. Let 
G n (x, P) = P{r n t n (X i,...,X n ) < x} . 

We will be assuming that G n (-, P) converges in distribution, at least for P £ P 0 . 
Of course, this would imply (as long as r n —> oo) that t n (Xi,..., X n ) —> 0 in 
probability for P £ Po- Naturally, t n should somehow be designed to distinguish 
between the competing hypotheses. The theorem we will present will assume tn is 
constructed to satisfy the following: t n (X i,..., X n ) —» t(P) in probability, where 
t(P) is a constant which satisfies t(P) = 0 if P £ Po and t(P) > 0 if P £ Pi. 
This assumption easily holds in typical examples. 

To describe the test construction, as in Subsection 15.7.1, let Y\..... Y/v„ be 
equal to the N n = (£) subsets of {AT,..., X„}, ordered in any fashion. Let t n ,b,i 
be equal to the statistic tb evaluated at the data set Y- The sampling distribution 
of T n is then approximated by 

N n 

Gn,b(x) = N ~ 1 ^2l{ntn,b,i < x} . (15.72) 

i=1 

Using this estimated sampling distribution, the critical value for the test is 
obtained as the 1 — a quantile of G n ,(,(•); specifically, define 

Qn.b (1 - a) = inf {a: : G n ,b { x ) >!-«}■ (15.73) 

Finally, the nominal level a test rejects H if and only if T n > g n ,b{7 ~ a )- 

The following theorem gives the asymptotic behavior of this procedure, show¬ 
ing the test is pointwise consistent in level and pointwise consistent in power. 
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In addition, an expression for the limiting power of the test is obtained under a 
sequence of alternatives contiguous to a distribution in the null hypothesis. 

Theorem 15.7.2 Assume b/n —> 0 and b —> oo as n —» oo. 

(i) Assume, for P £ Po, G„(P) converges weakly to a continuous limit law 
G(P), whose corresponding cumulative distribution function is G(-,P) and 
whose 1 — a quantile is g( 1 — a, P). If G(-, P ) is continuous at g{ 1 — a, P) 
and P £ Po, then 

gn,b( 1 — a) —» g( 1 — a, P ) in probability 

and 

P{T n > g n ,b{ 1 — a)} —» a as n —> oo. 

(ii) Assume the test statistic is constructed so that t„( AT,..., X n ) —> i(P) in 
probability, where t(P) is a constant which satisfies t(P) — 0 if P £ Po 
and t(P) > 0 if P £ Pi. Assume liminf n (T n /Tfc) > 1. Then, if P £ Pi, 
the rejection probability satisfies 

P{T n > g n ,b{ 1 — a)} —> 1 as n —» oo. 

(in) Suppose P n is a sequence of alternatives such that, for some Po £ Po, {P„ } 
is contiguous to {Po 1 }. Then, 

<7?i,t(l — a) —» g( 1 — a, Po) in P™-probability. 

Hence, if T„ converges in distribution to T under P n and G(-,Po) is 
continuous at g( 1 — a,Po), then 

Pn{T n > g n ,b{ 1 - «)} -s- Prob{T > g{ 1 - a, P 0 )}. 

The proof is similar to that of Theorem 15.7.1 (Problem 15.52). 

Example 15.7.4 Consider the special case of testing a real-valued parameter. 
Specifically, suppose #(•) is a real-valued function from P to the real line. The 
null hypothesis is specified by Po = {P : 6(P) = So}- Assume the alternative is 
one-sided and is specified by {P : 6(P) > 6o}- Suppose we simply take 

tn(X u ..., X n ) = 9 n (X i,..., X n ) — 9o . 

If 9„ is a consistent estimator of 9(P), then the hypothesis on t n in part (ii) of the 
theorem is satisfied (just take the absolute value of t„ for a two-sided alternative). 
Thus, the hypothesis on t n in part (ii) of the theorem boils down to verifying a 
consistency property and is rather weak, though this assumption can in fact be 
weakened further. The convergence hypothesis of part (i) is satisfied by typical 
test statistics; in regular situations, r„ = n 1//2 . ■ 

The interpretation of part (iii) of the theorem is the following. Suppose, instead 
of using the subsampling construction, one could use the test that rejects when 
T n > <?„(1 — Q,P), where g n (l — a, P) is the exact 1 — a quantile of the true 
sampling distribution G n (-,P). Of course, this test is not available in general 
because P is unknown and so is g n ( 1 — a, P). Then, the asymptotic power of the 
subsampling test against a sequence of contiguous alternatives {Pn} to P with 



682 15. General Large Sample Methods 


P in Po is the same as the asymptotic power of this fictitious test against the 
same sequence of alternatives. Hence, to the order considered, there is no loss in 
efficiency in terms of power. 


15.8 Problems 


Section 15.2 

Problem 15.1 Generalize Theorem 15.2.1 to the case where G is an infinite 
group. 

Problem 15.2 With p defined in (15.5), show that (15.6) holds. 


Problem 15.3 (i) Suppose Vi,..., Vs are exchangeable real-valued random 
variables; that is, their joint distribution is invariant under permutations. Let 
q be defined by 


<? 


1 

B 


B-l 

i + E ^ 

i =1 


Show, P{q < u} < u for all 0 < u < 1. Hint: Condition on the order statistics. 

(ii) With p defined in (15.7), show that (15.8) holds. 

(iii) How would you construct a p-value based on sampling without replacement 
from G? 


Problem 15.4 With p and p defined in (15.5) and (15.7), respectively, show 
that p — p — > 0 in probability. 

Problem 15.5 As an approximation to (15.9), let < 71 ,..., <?b_i be i.i.d. and 
uniform on G. Also, set ps to be the identity. Define 

, b 

R ni B{t)=-Y J I{Tn{g i X)<t} . 

i= 1 

Show, conditional on A', 

sup | Rn, B {t) — Rn{t) I -> 0 

t 

in probability as B —» oo, and so 

sup | R n ^B (!) - Rn{t)\ -4 0 

t 

in probability (unconditionally) as well. Do these results hold only under the null 
hypothesis? Hint: Apply Theorem 11.2.18. For a similar result based on sampling 
without replacement, see Romano (1989b). 

Problem 15.6 Suppose AT,...,A' n are i.i.d. according to a q.m.d. location 
model with finite variance. Show the ARE of the one-sample t -test with respect 
to the randomization t -test (based on sign changes) is 1 (even if the underlying 
density is not normal). 
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Problem 15.7 In Theorem 15.2.4, show the conclusion may fail if ipp is not an 
odd function. 

Problem 15.8 Verify (15.15) and (15.16). Hint: Let S be the number of positive 
integers i < m with W t = 1, and condition on S. 

Problem 15.9 Provide the remaining details for the proof of Theorem 15.2.5. 

Problem 15.10 In the two-sample problem of Example 15.2.6, suppose the un¬ 
derlying distributions are normal with common variance. For testing p(Py) = 
p{P z ) against p(Py ) > p{Pz) compute the limiting power of the randomization 
test based on the test statistic T m ,n given in (15.13) against contiguous alterna¬ 
tives of the form p(Py) = p{Pz) + /m -1 ^ 2 . Show this is the same as the optimal 
two-sample f-test. Argue that the two tests are asymptotically equivalent in the 
sense of Problem 13.24. 

Problem 15.11 Using Theorem 15.2.3, prove a result analogous to Theorem 
15.2.5 with Tm in replaced by T mtn defined in (15.19). Deduce that the two-sample 
permutation test is consistent in level for testing equality of population means, as 
long as the underlying populations have a finite variance. [This result was proved 
in Janssen (1997) by an alternative method.] 

Problem 15.12 Under the setting of Problem 11.52 for testing equality of Pois¬ 
son means Xi based on the test statistic T, show how to construct a randomization 
test based on T. Examine the limiting behavior of the randomization distribution 
under the null hypothesis and contiguous alternatives. 

Problem 15.13 Suppose (Xi, Yi),... (X n , Y n ) are i.i.d. bivariate observations 
in the plane, and let p denote the correlation between Xi and Y\. Let p n be the 
sample correlation 

- _ T,(Xi-Xn)(Yi-Y n )_ 

^ Ei(*i -*n) 2 £,•(** -W ' 

(i) For testing independence of X, and Y l: construct a randomization test based 
on the test statistic T n = n}^ 2 \p n \ . 

(ii) For testing p = 0 versus p > 0 based on the test statistic p„, determine the 
limit behavior of the randomization distribution when the underlying population 
is bivariate Gaussian with correlation p = 0. Determine the limiting power of the 
randomization test under local alternatives p = hn~ x ^ 2 . Argue that the random¬ 
ization test and the optimal UMPU test (5.75) are asymptotically equivalent in 
the sense of Problem 13.24. 

(iii) Investigate what happens if the underlying distribution has correlation 0, 
but X, and Y., are dependent. 


Section 15.3 

Problem 15.14 Assume Xi ,.... X n are i.i.d. according to a location scale 
model with distribution of the form F[{x — 8)/a\, where F is known, 8 is a lo¬ 
cation parameter, and a is a scale parameter. Suppose 8 n is a location and scale 
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equivariant estimator and <r„ is a location invariant, scale equivariant estimator. 
Then, show that the roots [6„ — 9]/a n and &n/cr are pivots. 

Problem 15.15 Let X = (Xi,... ,X n ) T and consider the linear model 

S 

Xi = ^2 dij/3j + a a , 
j =i 

where the e; are i.i.d. F , where F has mean 0 and variance 1. Here, the cnj are 
known, (5 = ((3 1 ,..., f3 s ) T and a are unknown. Let A be the nxs matrix with (i,j) 
entry atj and assume A has rank s. As in Section 11.3.3, let (3 n = (A T A) _1 A T A 
be the least squares estimate of f3. Consider the test statistic 

rr _ (n-s)0 n -(3)(A T A)0 n -(3) 
sSl 

where S 2 = (A' — A[3 n ) T (X — A/3 n )/(n — s). Is T„ a pivot when F is known? 

Section 15.4 

Problem 15.16 Suppose the convergences (15.23) and (15.24) only hold in 
probability. Show that (15.25) still holds. 

Problem 15.17 In Theorem 15.4.1, one cannot deduce the uniform convergence 
result (15.23) without the assumption that the limit law J(P) is continuous. Show 
that, without the continuity assumption for J{P), 

ph{Jn{Pn), Jn(P)) —■► 0 

with probability one, where pL is the Levy metric defined in Definition 11.2.3. 

Problem 15.18 In Theorem 15.4.3 (i), show that the assumption that 9(F n ) —> 
6(F) actually follows from the other assumptions. 

Problem 15.19 Reprove Theorem 15.4.3 under the assumption l?(|A'i| 3 ) < oo 
by using the Berry-Esseen Theorem. 

Problem 15.20 Prove the following extension of Theorem 15.4.3 holds. Let Df 
be the set of sequences { F n } such that F n converges weakly to a distribution G 
and a 2 (F n ) a 2 (G) = a 2 (F). Then, Theorem (15.4.3) holds with C f replaced 
by Df- (Actually, one really only needs to define Df so that and sequence {En} 
is tight and any weakly convergent subsequence of {E«} has the above property.) 
Thus, the possible choices for the resampling distribution are quite large in the 
sense that the bootstrap approximation J n (G n ) can be consistent even if G n is 
not at all close to F. For example, the choice where G„ is normal with mean 
X n and variance equal to a consistent estimate of the sample variance results 
in consistency. Therefore, the normal approximation can in fact be viewed as a 
bootstrap procedure with a perverse choice of resampling distribution. Show the 
bootstrap can be inconsistent if u 2 (G) ^ a 2 (F). 
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Problem 15.21 In the case that 6(P) is real-valued, Efron initially proposed 
the following construction, called the bootstrap percentile method. Let 9 n be an 
estimator of 6(P), and let J„(P) be the distribution of 9 n under P. Then, Efron’s 
two-sided percentile interval of nominal level 1 — a takes the form 

[jn 1 {^,Pn)Jn 1 0--^Pn)] ■ (15.74) 

Also, consider the root R„(X" ,8(P)) = n 1 ^ 2 (9 n — 9(P)), with distribution J n (P). 
Write (15.74) as a function of 9 n and the quantiles of J n (Pn )• Suppose Theorem 
15.4.1 holds for the root R n , so that J„(P) converges weakly to J(P). What must 
be assumed about J{P) so that P{9(P) £/„}—>■ 1 — a? 

Problem 15.22 Let 9 n be an estimate of a real-valued parameter 6(P). Suppose 
there exists an increasing transformation g such that 

g(0 n ) - g(0(P)) 

is a pivot, so that its distribution does not depend on P. Also, assume this 
distribution is continuous, strictly increasing and symmetric about zero. 

(i) Show that Efron’s percentile interval (15.74), which may be constructed 
without knowledge of g, has exact coverage 1 — a. 

(ii) Show that the percentile interval is transformation equivariant. That is, if 
<f> = m(8 ) is a monotone transformation of 9, then the percentile interval for cj> is 
the percentile interval for 8 transformed by m, at least if 4>n is taken to be m(8) n . 
This holds true for the theoretical percentile interval as well as its approximation 
due to simulation. 

(iii) If the parameter 9 only takes values in an interval 7 and 9 n does as well, 
then the percentile interval is range-preserving in the sense that the interval is 
always a subset of 7. 

Problem 15.23 Suppose 9 n is an estimate of some real-valued parameter 9(P). 
Let H n (x,9) denote the c.d.f. of 9 n under 9, with inverse — a,8). The 

percentile interval lower confidence bound of level 1 — a is then 77,7 1 (a, 9 n ). 
Suppose that, for some increasing transformation g, and constants z (called the 
bias correction) and a (called the acceleration constant), 

< I5 - 75) 

where 4> is the standard normal c.d.f. 

(i) Letting <j> n = g(8 n ), show that 9 n given by 

On,L = fl _1 + (z a + z){ 1 + «(£„)/[ 1 - a{z a + 2 0 )]| 

is an exact 1 — a lower confidence bound for 9. 

(ii) Because 9„,l requires knowledge of g, let 

dn,BC a =H~ 1 (0,d n ) , 


where 


(3 = $(« + (z a + z)j[ 1 - a(z a + z)] . 
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Show that 9 ni BC a = 8n,L- [The lower bound Q n ,BC a is called the BC a lower 
bound and Efron shows one may take z = ^~ 1 (G„(6 n , On)) and gives methods 
to estimate a; see Efron and Tibshirani (1993, Chapter 14).] 

Problem 15.24 Assume the setup of Problem 15.23 and condition (15.75). Let 
9o be any value of 6 and let 6 1 = G^ 1 (l — a, do). Let 

d n ,AP = Gn\!3',L) , 

where 

0' = G n {0o,6i) ■ 

Show that d n , A p is an exact level 1 — a lower confidence bound for d. [This 
is called the automatic percentile lower bound of DiCiccio and Romano (1989), 
and may be computed without knowledge of g, a or z. Its exactness holds under 
assumptions even weaker than (15.75).] 

Problem 15.25 Let X\,... ,X nx be i.i.d. with distribution Fx, and let 
Yi,...,Y„ y be i.i.d. with distribution Fy . The two samples are independent. 
Let p{F) denote the mean of a distribution F, and let cr 2 (F) denote the vari¬ 
ance of F. Assume cr 2 (Fx) and o 2 (Fy) are finite. Suppose we are interested in 
d = 9(Fx, Fy) = i-i(Fx) — p(Fy). Construct a bootstrap confidence interval for d 
of nominal level 1 — a, and prove that it asymptotically has the correct coverage 
probability. 

Problem 15.26 Let X\. • • •, X n be i.i.d. Bernoulli trials with success probability 
d. 

(i) . As explicitly as possible, find a uniformly most accurate upper confidence 
bound for d of nominal level 1 — a. State the bound explicitly in the case Xi = 0 
for every i. 

(ii) . Describe a bootstrap procedure to obtain an upper confidence bound for d 
of nominal level 1 — a. What does it reduce to for the previous data set? 

(iii) . Let B i_ a denote your upper bootstrap confidence bound for d. Then, Po(d < 
Bi~ a ) —> 1 — a as n —> oo. Prove the following. 

sup \P e (6 < Bi- a ) - (1 - a)I 
e 

does not tend to 0 as n —» oo. 

Problem 15.27 Let A'i,...,A'„ be i.i.d. with c.d.f. F. mean i-i(F) and finite 
variance a 2 (F). Consider the root R n = n 1 ^ 2 (X^ — p 2 (F)) and the bootstrap 
approximation to its distribution J n {F ■,), where F n is the empirical c.d.f. Deter¬ 
mine the asymptotic behavior of J n (F n ). Hint: Distinguish the cases /j.{F) = 0 
and p{F) ^ 0. 

Problem 15.28 Show why (15.43) is true. 

Problem 15.29 (i) Under the setup of Example 15.4.6, prove that Theorem 
15.4.7 applies if studentized statistics are used. 
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(ii) In addition to the Xi,..., X n , suppose i.i.d. Yi,..., Y n / are observed, with 
Yi = (Y),i,..., Yi t s). The distribution of Yi need not be that of Xi. Suppose 
the mean of Yi is ( Hi ,..., n' s ). Generalize Example 15.4.6 to simultaneously test 
Hi : Hi = Hi- Distinguish between two cases, first where the XjS are independent 
of the Yj s, and next where (Xi,Yi) are paired (so n = n') and Xi need not be 
independent of Yi. 

Problem 15.30 Under the setup of Example 15.4.7, provide the details to show 
that the FWER is asymptotically controlled. 

Problem 15.31 Under the setup of Example 15.4.7, suppose that there is also 
an i.i.d. control sample Xo,i,..., A'o,n 0 , independent of the other As. Let Ho 
denote the mean of the controls. Now consider testing Hi : Hi — Mo- Describe a 
method that asymptotically controls the FWER. 

Problem 15.32 Under the setup of Example 15.4.7, let Fi denote the distri¬ 
bution of the ith sample. Now, consider H',j : Fi = Fj based on the same test 
statistics. Describe a randomization test that has exact control of the FWER. 
[Hint: Recall Theorem 9.1.3(h).] 

Problem 15.33 Let ei,e 2 , • • • be i.i.d. N( 0,1). Let X ; = /u + a + f3u+i with /3 a 
fixed nonzero constant. The A \ form a moving average process studied in Section 
11.3.1. 

(i) Examine the behavior of the nonparametric bootstrap method for estimating 
the mean using the root n 1 ^ 2 (X n — h) an d resampling from the empirical distri¬ 
bution. Show that the coverage probability does not tend to the nominal level 
under such a moving average process. 

(ii) Suppose n = bk for integers b and k. Consider the following moving blocks 

bootstrap resampling scheme. Let = (A*, Xi+i,..., A'i+6-i) be the block of 
b observations beginning at “time” i. Let A'*,..., X* be obtained by randomly 
choosing with replacement k of the n — b + 1 blocks Li,6; that is, A'*,..., X£ are 
the observations in the first sampled block, X,) +1 ,..., are the observations 
from the second sampled block, etc. Then, the distribution of n 1-/2 [X n — /t] is 
approximated by the moving blocks bootstrap distribution given by the distribu¬ 
tion of — X n ], where X„ = ^"=1 / n - If b is fixed, determine the mean 

and variance of this distribution as n —¥ oo. Now let b —> oo as n —> oo. At 
what rate should b —> oo so that the mean and variance of the moving blocks 
distribution tends to the same limiting values as the true mean and variance, at 
least in probability? [The moving blocks bootstrap was independently discovered 
by Kunsch (1989) and Liu and Singh (1992). The stationary bootstrap of Politis 
and Romano (1994a) and other methods designed for dependent data are studied 
in Lahiri (2003).] 


Section 15.5 

Problem 15.34 Under the assumptions of Theorem 15.5.2, show that, for any 
e > 0, the expansion (15.47) holds uniformly in a £ [e, 1 — e]. 
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Problem 15.35 Under the assumptions of Theorem 15.5.1, show that, for any 
e > 0, the expansion (15.48) holds uniformly in a £ [e, 1 — ej. 

Problem 15.36 Suppose Y n is a sequence of random variables satisfying 

P{Y n < t} = g 0 (t) + gi(t)n~ 1/2 + 0(n _1 ) , 

uniformly in t, where go and g\ have uniformly bounded derivatives. If T n = 
Op(ri _1 ), then show, for any fixed (nonrandom) sequence t„, 

P{Y n <t n + T n } = go{tn) + gi{t n )n~ 1/2 + 0(n _1 ) . 

Problem 15.37 Assuming the expansions in the section hold, show that the 
two-sided bootstrap interval (15.56) has coverage error of order n _1 . 

Problem 15.38 Assuming the expansions in the section hold, show that the 
two-sided bootstrap-f interval (15.62) has coverage error of order n -1 . 

Problem 15.39 Verify the expansion (15.63) and argue that the resulting 
interval J„(l — q„) has coverage error 0(n -1 ). 

Problem 15.40 In the nonparametric mean setting, determine the one- and 
two-sided coverage errors of Efron’s percentile method described in (15.74). 

Problem 15.41 Assume F has infinitely many moments and is absolutely con¬ 
tinuous. Under the notation of this section, argue that n 1/,2 [J„ (f, F n ) — J n (t, P)] 
has an asymptotically normal limiting distribution, as does n[K n (t,F„) — 
I<n(t,F)]. 


Problem 15.42 (i) In a normal location model A(/x, a 2 ), consider the root R„ = 
•n}! 2 (X n — fi), which is not a pivot. Show that bootstrap calibration, by parametric 
resampling, produces an exact interval. 

(ii) Next, consider the root n 1 ' 2 (S' 2 — a 2 ), where S' 2 is the usual unbiased 
estimate of variance. Show that bootstrap calibration, by parametric resampling, 
produces an exact interval. 


Problem 15.43 (i) Show the bootstrap interval (15.22) can be written as 

{9e0: J n {R n (X n ,9),P n ) < l-a} (15.76) 

if, for the purposes of this problem, J n (x,P) is defined as the left continuous 
c.d.f. 


J n (x,P) = P{R n (X n ,0(P)) < x} 

and Jp 1 (l — a, P) is now defined as 

J^ x (l — a,P) = sup{a: : J„(x, P) < 1 — a} . 

[Hint: If a random variable Y has left continuous c.d.f. F(x) = P{Y < x} and 
P -1 (l — a) is the largest 1 — a quantile of F, then the event (A < P -1 (l — a)} 
is identical to {P(A) < 1 — a} for any random variable X (which need not have 
distribution F). Why?] 
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(ii) The bootstrap interval (15.76) pretends that 

R n ,i(X n , 9(P)) = J n (R n {X n ,9(P)),P n ) 

has the uniform distribution on (0,1). Let J n ,i(P) be the actual distribution 
of R n ,i{X n , 9(P)) under P, with left continuous c.d.f. denoted J n ,i(x, P). This 
results in a new interval with R„, and J n replaced by R n ,i and in (15.76). 
Show that the resulting interval is equivalent to bootstrap calibration of the 
initial interval. [The mapping of R n into R „ t i by estimated c.d.f. of the former 
is called prepivoting. Beran (1987, 1988b) argues that the interval based on R„,i 
has better coverage properties than the interval based on R n .\ 


Section 15.6 

Problem 15.44 In Example 15.6.1, rather than exact evaluation of G n (-,Qn), 
describe a simulation test of H that has exact level a. 

Problem 15.45 In Example 15.6.2, why is the parametric bootstrap test exact 
for the special case of Example 12.4.7? 

Problem 15.46 In the Behrens-Fisher problem, show that (15.64) and (15.65) 
hold. 

Problem 15.47 In the Behrens-Fisher problem, verify the bootstrap-t has 
rejection probability equal to a + 0(n~ 2 ). 

Problem 15.48 In the Behrens-Fisher problem, what is the order of error in 
rejection probability for the likelihood ratio test? What is the order of error in 
rejection probability if you bootstrap the non-studentized statistic vl^^Xn^ — 

Xn,*)- 

Problem 15.49 In Example 15.6.4, with resampling from the empirical distri¬ 
bution shifted to have mean 0, what are the errors in rejection for the tests based 
on T n and T' n ? How do these tests differ from the corresponding tests obtained 
through inverting bootstrap confidence bounds? 

Problem 15.50 Let X\..... X n be i.i.d. with a distribution P on the real 
line, and let P n be the empirical distribution function. Find Q that minimizes, 
Skl(Pu, Q), where Skl is the Kullback-Leibler divergence defined by (15.66). 

Problem 15.51 Suppose Xi,..., X n are i.i.d. real-valued with c.d.f. F. The 
problem is to test the null hypothesis that F is a 2 ) for some (p, a 2 ). Consider 
the test statistic 

T n = n 1/2 sup | F n (t) - <E>((t - X n )/» n )| , 

t 

where F n is the empirical c.d.f. and (X„,<r 2 ) is the MLE for (/r, <r 2 ) assuming 
normality. Argue that the distribution of T n does not depend on (/r, o 2 ) and 
describe an exact bootstrap test construction. [Such problems are studied in 
Romano (1988)]. 
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Section 15.7 

Problem 15.52 Prove Theorem 15.7.2. [Hint: For (ii), rather than considering 
G n ,b(x), just look at the empirical distribution of the values of t nt b,i (not scaled 
by Tb) and show G° n b (-) converges in distribution to a point mass at t(P).\ 

Problem 15.53 Prove a result for subsampling analogous to Theorem 15.4.7, 
but that does not require assumption (15.42). [Theorem 15.4.7 applies to test¬ 
ing real-valued parameters; a more general multiple testing procedure based on 
subsampling is given by Theorem 4.4 of Romano and Wolf (2004).] 

Problem 15.54 To see how subsampling extends to a dependent time series 
model, assume X \,..., X n are sampled from a stationary time series model that 
is m-dependent. [Stationarity means the distribution of the Xi, X 2 , ■ . . is the 
same as that of Xt,Xt+i ,. .. for any t. The process is m-dependent if, for any 
t and m, (AT,..., X t ) and (AT+ m +i, X t +m+ 2 , ■ ■ •) are independent; that is, ob¬ 
servations separated in time by more than m units are independent.] Suppose 
the sum in the definition (15.67) of L r ^ extends only over the n — b + 1 sub¬ 
samples of size b of the form (AT, AT+i,..., Xj_|_j>_i); call the resulting estimate 
L n ,b- Under the assumption of stationarity and m-dependence, prove a theorem 
analogous to Theorem 15.7.1. [The theorem can be extended to much weaker 
types of dependence; see Politis, Romano, and Wolf (1999).] 


15.9 Notes 

Early references to permutations tests were provided at the end of Chapter 5. An 
elementary account is provided by Good (1994), who provides an extensive bib¬ 
liography, and Edgington (1995). Multivariate permutation tests are developed 
in Pesarin (2001). The present large sample approach is due to Hoeffding (1952). 
Applications to block experiments is discussed in Robinson (1973). Expansions 
for the power of rank and permutation tests in the one- and two-sample prob¬ 
lems are obtained in Albers, Bickel and van Zwet (1976) and Bickel and van Zwet 
(1978), respectively. A full account of the large sample theory of rank statistics 
is given in Hajek, Sidak, and Sen (1999). Robust two-sample permutation tests 
are obtained in Lambert (1985). 

The bootstrap was discovered by Efron (1979), who coined the name. Much of 
the theoretical foundations of the bootstrap are laid out in Bickel and Freedman 
(1981) and Singh (1981). The development in Section 15.4 is based on Beran 
(1984). The use of Edgeworth expansions to study the bootstrap was initiated in 
Singh (1981) and Babu and Singh (1983), and is used prominently in Hall (1992). 
There have since been hundreds of papers on the bootstrap, as well as several 
book length treatments, including Hall (1992), Efron and Tibshirani (1993), Shao 
and Tu (1995), Davison and Hinkley (1997) and Lahiri (2003). Comparisons of 
bootstrap and randomization tests are made in Romano (1989b) and Janssen and 
Pauls (2003). Westfall and Young (1993) and van der Lann, Dudoit and Pollard 
(2004) apply resampling to multiple testing problems. Theorem 15.4.7 is based 
on Romano and Wolf (2004). 
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The method of empirical likelihood referred to in Example 15.6.4 is fully treated 
in Owen (2001). Similar to parametric models, the method of empirical likelihood 
can be improved through a Bartlett correction, yielding two-sided tests with 
error in rejection probability of 0(n -2 ); see DiCiccio, Hall and Romano (1991). 
Alternatively, rather than using the asymptotic Chi-squared distribution to get 
critical values, a direct bootstrap approach resamples from Q n . Higher order 
properties of such procedures are considered in DiCiccio and Romano (1990). 

The roots of subsampling can be traced to Quenouille’s (1949) and Tukey’s 
(1958a) jackknife. Hartigan (1969) and Wu (1990) used subsamples to construct 
confidence intervals, but in a very limited setting. A general theory for using 
subsampling to approximate a sampling distribution is presented in Politis and 
Romano (1994b), including i.i.d. and data-dependent settings. A full treatment 
with numerous references is given by Politis, Romano, and Wolf (1999). 



AppendixA 

Auxiliary Results 


A.l Equivalence Relations; Groups 

A relation: x ~ y among the points of a space X is an equivalence relation if it 
is reflexive, symmetric, and transitive, that is, if 


(i) x - 

^ x for all x £ X\ 

(ii) x 

y implies y ~ x; 

(iii) x 

^ y, y ~ z implies x ~ z. 


Example A.1.1 Consider a class of statistical decision procedures as a space, 
of which the individual procedures are the points. Then the relation defined by 
5 ~ S' if the procedures 5 and S' have the same risk function is an equivalence 
relation. As another example consider all real-valued functions defined over the 
real line as points of a space. Then / ~ g if f(x) = g(x) a.e. is an equivalence 
relation. 

Given an equivalence relation, let D x denote the set of points of the space that 
are equivalent to x. Then D x = D y if x ~ y, and D x C\D y = 0 otherwise. Since by 
(i) each point of the space lies in at least one of the sets D x , it follows that these 
sets, the equivalence classes defined by the relation ~, constitute a partition of 
the space. 

A set G of elements is called a group if it satisfies the following conditions. 

(i) There is defined an operation, group multiplication, which with any two 
elements a, b £ G associates an element c of G. The element c is called the 
product of a and b and is denoted by ab. 
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(ii) Group multiplication obeys the associative law 

(ab)c = a(bc). 

(iii) There exists an element e £ G, called the identity, such that 

ae = ea = a for all a £ G. 

(iv) For each element a £ G, there exists an element a -1 £ G, its inverse, such 
that 

-l -l 

aa = a a = e. 

Both the identity element and the inverse a -1 of any element a can be shown 
to be unique. 


Example A.1.2 The set of all n x n orthogonal matrices constitutes a group if 
matrix multiplication and inverse are taken as group multiplication and inverse 
respectively, and if the identity matrix is taken as the identity element of the 
group. With the same specification of the group operations, the class of all non¬ 
singular n x n matrices also forms a group. On the other hand, the class of all 
n x n matrices fails to satisfy condition (iv). 

If the elements of G are transformations of some space onto itself, with the 
group product ba defined as the result of applying first transformation a and 
following it by b, then G is called a transformation group. Assumption (ii) is then 
satisfied automatically. For any transformation group defined over a space X the 
relation between points of X given by 

x ~ y if there exists a £ G such that y = ax 

is an equivalence relation. That it satisfies conditions (i), (ii), and (iii) required 
of an equivalence follows respectively from the defining properties (iii), (iv), and 
(i) of a group. 

Let C be any class of 1 : 1 transformations of a space, and let G be the class 
of all finite products a^aj 1 ... a^ 1 , with oi,..., a m £<D,m = l,2,... , where 
each of the exponents can be +1 or —1 and where the elements oi, a 2 , ... need 
not be distinct. Then it is easily checked that G is a group, and is in fact the 
smallest group containing <D. 


A.2 Convergence of Functions; Metric Spaces 

When studying convergence properties of functions it is frequently convenient to 
consider a class of functions as a realization of an abstract space T of points / 
in which convergence of a sequence /„ to a limit /, denoted by /„—»•/, has been 
defined. 


Example A.2.1 Let y be a measure over a measurable space (X,A). 
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(i) Let T be the class of integrable functions. Then /„ converges to f in the 
mean if 1 

/ l/n “ /I <*„-)• 0. (A.l) 

(ii) Let T be a uniformly bounded class of measurable functions. The sequence 
is said to converge to f weakly if 

J Updp^ J fpdp (A.2) 

for all functions p that are integrable p. 

(iii) Let T be the class of measurable functions. Then f n converges to / pointwise 
if 

fn(x) -> f(x) a.e. p. (A.3) 

A subset of J~o is dense in T if, given any / € J-, there exists a sequence in Fo 
having / as its limit point. A space T is separable if there exists a countable dense 
subset of T. A space T such that every sequence has a convergent subsequence 
whose limit point is in T is compact. 2 A space T is a metric space if for every 
pair of points f,gmF there is defined a metric (or distance) d(f, g) > 0 such 
that 

(i) d(f, g) = 0 if and only if f = g; 

(ii) d(f,g ) = d{g,f); 

(iii) d{f, g) + d(g, h) > d(f, h ) for all /, g, h. 

The space is a pseudometric space if (i) is replaced by 
O') d(f, /) = 0 for all f € J-■ 

A pseudometric space can be converted into a metric space by introducing the 
equivalence relation / ~ g if d(f , g) = 0. The equivalence classes F, G, ... then 
constitute a metric space with respect to the metric D(F, G) = d(f, g) where 
f£F,geG. 

In any pseudometric space a natural convergence definition is obtained by 
putting /„ -> / if d(/n, /) -I 0. 

Example A.2.2 The space of integrable functions of Example A.2.1(i) becomes 
a pseudometric space if we put 

d{f,g) = J \f~g\dp 

and the induced convergence definition is that given by (1). 


1 Here and in the examples that follow, the limit / is not unique. More specifically, 
if f n —»■ /, then /n —> g if and only if / = g (a.e. /i). Putting / ~ g when / = g 
(a.e. /i), uniqueness can be obtained by working with the resulting equivalence classes 
of functions rather than with the functions themselves. 

2 The term compactness is more commonly used for an alternative concept, which 
coincides with the one given here in metric spares. The distinguishing term sequential 
compactness is then sometimes given to the notion defined here. 
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Example A.2.3 Let V be a family of probability distributions over (X, A). Then 
V is a metric space with respect to the metric 

d(P, Q) = sup \P(A) — Q(A)\. (A.4) 

AeA 

Lemma A.2.1 If T is a separable pseudometric space, then every subset of T 
is also separable. 

Proof. By assumption there exists a dense countable subset {/„} of T. Let 

S m ,n = {f:d(f,f n )< ^}, 

and let A be any subset of T . Select one element from each of the intersections 
A fl Sm,n that is nonempty, and denote this countable collection of elements by 
Ao . If a is any element of A and m any positive integer, there exists an element 
fn m such that d(a, fn m ) < 1/m. Therefore a belongs to S m ,n m , the intersection 
Ar\Sm,n m is nonempty, and there exists therefore an element of Ao whose distance 
to a is < 2/m. This shows that Aq is dense in A, and hence that A is separable. ■ 


Lemma A.2.2 A sequence f„ of integrable functions converges to f in the mean 
if and only if 


fn d jf - 


f dp uniformly for A £ A. 


(A.5) 


Proof. That (1) implies (5) is obvious, since for all A £ A 


f fn dp [ fdp < f | fn f\ dp. 

J A J A J 

Conversely, suppose that (5) holds, and denote by A n and A' n the set of points 
x for which fn(x) > f(x) and f n (x) < f(x) respectively. Then 

[\fn-f\dp=[ (fn-f)dp-f (fn-f)dp^- O.B 

J Ja„ Jal 


Lemma A.2.3 A sequence f„ of uniformly bounded functions converges to a 
bounded function f weakly if and only if 


fn dp 


fdp 


for all A with p{A) < oo. 


(A.6) 


Proof. That weak convergence implies (6) is seen by taking for p in (2) the 
indicator function of a set A, which is integrable if p(A) < oo. Conversely (6) 
implies that (2) holds if p is any simple function s = a ilA t with all the p{Ai) < 
oo. Given any integrable function p, there exists, by the definition of the integral, 
such a simple function s for which f \p — s\ dp < e/3M, where M is a bound on 
the |/|’s. We then have 


/ 


(fn ~ f)pdp 


< 


fn(p-s)dp 


1 


f(s — p) dp 


j (fn - f)s dp 
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The first two terms on the right-hand side are < e/3, and the third term tends to 
zero as n tends to infinity. Thus the left-hand side is < e for n sufficiently large, 
as was to be proved. ■ 


Lemma A.2.4 3 

functions with 



Then pointwise convergence of f n to f implies that /„ —» / in the mean. 


Proof. If g n = f n — /, then g > —/, and the negative part g~ = max(— g„,0) 
satisfies \gif \ < /. Since g n (x) — > 0 (a.e. f.i ), it follows from Theorem 2.2.2(h) of 
Chapter 2 that f dp. —> 0, and f g+ dp then also tends to zero, since f g n dp = 
0. Therefore f \g n \ dp = f (g„ + g~) dp > 0, as was to be proved. 

Let P and P n , n = 1, 2, ... be probability distributions over (X,A) with 
densities p„ and p with respect to p. Consider the convergence definitions 

(a) p n ^ p (a.e. p)\ 

(b) / | p n - p\dp -> 0; 

(c) f gpndp —> J gpdp for all bounded measurable g\ 
and 

(b') P„(A) -»• P{A) uniformly for all A £ A\ 

(c!) P„{A) -»• P{A) for all A £ A. 

Then Lemmas A.2.2 and A.2.4 together with a slight modification of Lemma 
A.2.3 show that (a) implies (b) and (b) implies (c), and that (b) is equivalent to 
(b') and (c) to (c'). It can further be shown that neither (a) and (b) nor (b) and 
(c) are equivalent. 3 4 ■ 


A.3 Banach and Hilbert Spaces 

A set V is called a vector space (or linear space) over the reals if there exists a 
function + on V x V to V and a function • on R x V to V which satisfy for 
x,y,z£ V, 

(i) a : + y = y + x. 

(ii) (x + y) + z = z + (y + z). 

(iii) There is a vector 0£V:x + 0 = x for all x £ V. 

(iv) A(x + y) = \x + \y for any A G R. 

(v) (Ai + \z)x = Ai* + A 2 * for A; £ R. 

(vi) Ai(A 2 *) = (A 1 A 2 )* for A i £ R. 

(vii) 0 • * = 0, 1 • x = x. 

3 Scheffe (1947). 

4 Robbins (1948). 
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The operation + is called addition by scalars and • is multiplication by scalars. 
A nonnegative real-valued function || || defined on a vector space is called a norm 
if 

(i) ||*|| = 0 if and only if x = 0. 

(ii) II*+ 3/11 < 11*11 + 113/11- 

(iii) ||A*|| = |A| ||*||. 

A vector space with norm || || is a then a metric space if we define the metric 
d to be d(x, y) = ||* — j/||. 

A sequence {*„} of elements in a normed vector space V is called a Cauchy 
sequence if, given e > 0, there is an N such that for all m,n > N, we have 
\\xn — Xm || < e. A Banach space is a normed vector space that is complete in the 
sense that every Cauchy sequence {*„} satisfies ||* n — a;|| —0 for some * £ V. 


Example A.3.1 ( L p spaces.) Let p be a measure over a measurable space 
(X,A). Fix p > 0 and L p [X,p] denote the measurable functions / such that 
f \f\dp < oo. If we identify equivalence classes of functions that are equal almost 
everywhere /./, then, for p > 1, this vector space becomes a normed vector space 
by defining 


:/ 


I f\ V dp 


i/p 


In this case, the triangle inequality 


11/ + g\\ P < II/IIp + llflllp 


is known as Minkowski’s inequality. Moreover, this space is a Banach space. 5 


A Hilbert space H is a Banach space for which there is defined a function (x, y) 
on H x H to R, called the inner product of * and y, satisfying, for Xi, y £ H , 
A i £ R, 

(i) (Ai*i + A 2 * 2 , 3 /> = Ai(*i ,y) + A 2 (* 2 , 3 /) • 

(ii) (x, y) = (y, x) . 

(iii) (*>*> = ll*H 2 • 

Two vectors x and y of H are called orthogonal if {x, y) = 0. A collection 
Hq C H of vectors is called an orthogonal system if any two elements in Ho 
are orthogonal. An orthogonal system is orthonormal if each vector in it has 
norm 1 . An orthonormal system Ho is called complete if (x , h) = 0 for all h £ Ho 
implies * = 0. In a separable Hilbert space, every orthonormal system is countable 
and there exists a complete orthonormal system. Letting {hi, h 2 ,...} denote a 
complete orthonormal system, Parseval’s identity says that, for any x £ H, 

OO 

IWI 2 = • ( A - 7 ) 

3 = 1 


Example A.3.2 (L 2 spaces.) In example A.3.1 with p — 2, the equivalence 
classes of square integrable functions is a Hilbert space with inner product given 


'''For proofs of the results in this section, see Chapter 5 of Dudley (1989). 
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by 


</i,/ 2 > = J hhdp . 


If X is [0,1] and p is Lebesgue measure, then a complete orthonormal system 
is given by the functions fj{u) = %/2sin(7rjfu), j = 1,2,.... Therefore, for any 
square integrable function /, Parseval’s identity yields 


rl °° T r 1 

/ f 2 (u)du = 2 E f(u) sin(7r ju)du 

Jo j=1 U o 


A.4 Dominated Families of Distributions 

Let M be a family of measures defined over a measurable space (X, A). Then 
M is said to be dominated by a cr-finite measure p defined over (X,A) if each 
member of M is absolutely continuous with respect to p. The family M is said 
to be dominated if there exists a cr-finite measure dominating it. Actually, if M 
is dominated there always exists a finite dominating measure. For suppose that 
M is dominated by /i and that X = U Ai, with p{Ai) finite for all i. If the sets 
Ai are taken to be mutually exclusive, the measure v{A) = Y p(An Ai) /2‘ p(Ai) 
also dominates M and is finite. 

Theorem A.4.1° A family V of probability measures over a Euclidean space 
(X,A) is dominated if and only if it is separable with respect to the metric (4) or 
equivalently with respect to the convergence definition 

P„ —> P if P„(A) P(A) uniformly for A £ A. 

Proof. Suppose first that V is separable and that the sequence {P„} is dense 
in V, and let p = J^P n /2”. Then p(A) = 0 implies P n (A) = 0 for all n, and 
hence P(A) = 0 for all P £ V. Conversely suppose that V is dominated by a 
measure /x, which without loss of generality can be assumed to be finite. Then we 
must show that the set of integrable functions dP/dp is separable with respect 
to the convergence definition (5) or, because of Lemma A.2.2, with respect to 
convergence in the mean. It follows from Lemma A.2.1 that it suffices to prove 
this separability for the class T of all functions / that are integrable p. Since by 
the definition of the integral every integrable function can be approximated in 
the mean by simple functions, it is enough to prove this for the case that T is the 
class of all simple integrable functions. Any simple function can be approximated 
in the mean by simple functions taking on only rational values, so that it is 
sufficient to prove separability of the class of functions 'Yf, VilAi where the r’s 
are rational and the A’s are Borel sets, with finite p- measure since the /’s are 
integrable. It is therefore finally enough to take for T the class of functions I a, 
which are indicator functions of Borel sets with finite measure. However, any such 
set can be approximated by finite unions of disjoint rectangles with rational end 


Berger (1951b). 
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points. The class of all such unions is denumerable, and the associated indicator 
functions will therefore serve as the required countable dense subset of T. ■ 

An examination of the proof shows that the Euclidean nature of the space 
(X, A) was used only to establish the existence of a countable number of sets 
At £ A such that for any A £ A with finite measure there exists a subsequence A; 
with p(Ai) —» p(A). This property holds quite generally for any cr-field A which 
has a countable number of generators, that is, for which there exists a countable 
number of sets Bi such that A is the smallest cr-field containing the Bi. 7 It 
follows that Theorem A.4.1 holds for any cr-field with this property. Statistical 
applications of such cr-helds occur in sequential analysis, where the sample space 
X is the union X — UiXi of Borel subsets Xi of i-dimensional Euclidean space. In 
these problems, Xi is the set of points (xi, ..., Xi) for which exactly i observations 
are taken. If Ai is the cr-field of Borel subsets of Xi, one can take for A, the cr- 
field generated by the Ai, and since each Ai possesses a countable number of 
generators, so does A. 

If A does not possess a countable number of generators, a somewhat weaker 
conclusion can be asserted. Two families of measures M and A f are equivalent if 
p(A) = 0 for all p £ A4 implies v(A) = 0 for all v £ M and vice versa. 


Theorem A.4.2 s A family V of probability measures is dominated by aa-finite 
measure if and only if V has a countable equivalent subset. 


Proof. Suppose first that V has a countable equivalent subset {Pi, pj,.. .}. Then 
V is dominated by p = 5^P„/2 n . Conversely, let V be dominated by a cr-finite 
measure p, which without loss of generality can be assumed to be finite. Let Q 
be the class of all probability measures Q of the form ^ CiPi, where Pi £ V, the 
c’s are positive, and 'Yfa = 1. The class Q is also dominated by p, and we denote 
by q a fixed version of the density dQ/dp. We shall prove the fact, equivalent to 
the theorem, that there exists Q o in Q such that Qo(A) = 0 implies Q(A) = 0 
for all Q £ Q. 

Consider the class C of sets C in A for which there exists Q £ Q such that 
q(x) > 0 a.e. p on C and Q(C ) > 0. Let p(Ci) tend to sup<[; p(C), let qi(x) > 0 
a.e. on Ci, and denote the union of the Ci by Co- Then qo(x)^2ciqi(x) agrees 
a.e. with the density of Qo = X) c »Q i and is positive a.e. on Co, so that Co £ 
(D. Suppose now that Qo(A) = 0, let Q be any other member of Q, and let 
C = {x : q(x) > 0}. Then Qo(A D Co) = 0, and therefore p(A D Co) = 0 and 
Q(A n Co) = 0. Also Q(A n Co C C) = 0. Finally, Q(A n Co l~l C) >0 would lead 
to p(Co U [A fl Co fl C]) > p(Co) and hence to a contradiction of the relation 
p(Co) = sup(£ p(C), since A fl Co fl C and therefore Co U [A fl Co D C] belongs 
to C. ■ 


'A proof of this is given for example by Halmos (1974, Theorem B of Section 40). 
8 Halmos and Savage (1949). 
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A.5 The Weak Compactness Theorem 

The following theorem forms the basis for proving the existence of most powerful 
tests, most stringent tests, and so on. 


Theorem A.5.1 u (Weak compactness theorem). Let p be a a-finite mea¬ 
sure over a Euclidean space, or more generally over any measurable space (XA) 
for which A has a countable number of generators. Then the set of measurable 
functions fi with 0 < fi < 1 is compact with respect to the weak convergence (2). 


Proof. Given any sequence {fin}, we must prove the existence of a subsequence 
{finj } and a function fi such that 


lim 


J fimpdp =s J fipdp 


for all integrable p. If p* is a finite measure equivalent to p, then p* is integrable 
p* if and only if p = (dp* / dp)p* is integrable p, and f fipdp = f fip* dp* for all 
fi. We may therefore assume without loss of generality that p is finite. Let {p n } 
be a sequence of p’s which is dense in the p’s with respect to convergence in the 
mean. The existence of such a sequence is guaranteed by Theorem A.4.1 and the 
remark following it. If 


$n(p) 


finpdp , 


the sequence $ n (p) is bounded for each p. A subsequence <l?„ fe can be extracted 
such that &„ k (pm) converges for each p m by the following diagonal process. 
Consider first the sequence of numbers {4? n (pi)} which possesses a convergent 
subsequence (pi), (pi), ■ ■ • ■ Next the sequence (P 2 ), &n 2 (pf), ■ ■ ■ has 

a convergent subsequence ^n'fipf), $ n " (P 2 ), ■ ■ ■ ■ Continuing in this way, let 

m = n'i, ri 2 = n'f, riff, -Then m < n 2 < ..., and the sequence {"hni} 

converges for each p m . It follows from the inequality 


/ 


(fin, ~ fin , )P dp 


< 


/ 


(0rij 4*rLi)Pm 


+ 2 / |p- Pm I dp 


that 4>„ i (p) converges for all p. Denote its limit by < 3>(p), and define a set function 
4>* over A by putting 


$*( j 4) = $(7 a ). 


Then <!?* is nonnegative and bounded, since for all A, < p(A). To see 

that it is also countably additive let A = U Ak, where the Ak are disjoint. Then 
= lim4?*^(U-Afc) and 



$*(A fc ) 


< 



<t) ni ®*(A k ) 

k =l 


9 Banach (1932). The theorem is valid even without the assumption of a countable 
number of generators; see Nolle and Plachky (1967) and Aloaglu’s theorem, given for 
example in Roy den (1988). 
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+ 



E 

k=m-\-1 


Here the second term is to be taken as zero in the case of a finite sum A = U ™ =1 Ak, 
and otherwise does not exceed 2/r(U^T m+1 Ak), which can be made arbitrarily 
small by taking m sufficiently large. For any fixed m the first term tends to zero 
as i tends to infinity. Thus <L* is a finite measure over (X,A). It is furthermore 
absolutely continuous with respect to fi, since /r(A) = 0 implies <&m {Ia) = 0 for 
all i, and therefore < F(/a) = < h*(A) = 0 We can now apply the Radon-Nikodym 
theorem to get 

$*(A) = f for all A, 

J A 

with 0 < 4> < 1. We then have 



for all A, 


and weak convergence of the (j> ni to rf) follows from Lemma A.2.3. ■ 
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Equivariance, 13, 396. See also 
Invariance 

Equivariant confidence bands, 255, 
376, 384, 390 

Equivariant confidence bounds, 272 
Equivariant confidence sets, 248, 

251, 252, 272, 273, 276; and 
pivotal quantities, 274. See 
also Uniformly most accurate 
confidence sets 

Error control: strong, 350; weak, 350 
Error of first and second kind, 57, 66; 
of type 3, 139; familywise error 
rate, 349; directional, 373 
Essentially complete class of decision 
procedures, 17, 54, 69, 96. See 
also Completeness of a class of 
decision procedures 
Estimation, see Confidence bands; 

Conhdence bounds; Confidence 
intervals; Confidence sets; 
Equivariance; Maximum 
likelihood; Median: Point 
estimation; Unbiasedness 
Euclidean sample space, 41 
Exchangeable, 355 

Expectation (of a random available), 
33, 39; conditional, 37, 39, 42 
Expected order statistics, 243 
Experimental design, see Design of 
experiments 

Exponential distribution, 22, 68, 

74; confidence bounds and 
intervals in, 74; order statistics 
from, 54; relation to Pareto 
distribution, 94; relation 
to Poisson process, 54, 


68; sufficient statistics for, 

27; testing against gamma 
distribution, 200; testing 
against normal or uniform 
distribution, 260; tests for 
parameters of, 93, 195; two- 
sample problem for, 259. See 
also Chi-squared distribution; 
Gamma distribution; Life 
testing 

Exponential family, 46, 55; 

admissibility of tests in, 

234; completeness of, 117; 
differentiability of, 49; 
equivalent forms of, 123; 
expansion of loglikelihood, 

483, 484; median unbiased 
estimators in, 162; moments 
of sufficient statistics, 55; 
monotone likelihood ratio of, 
67; natural parameter space of, 
48, 55, 119; q.m.d. property, 
488; regression models for, 

210; testing in multiparameter, 
119, 121, 123, 234; total 
positivity of, 104. See also 
One-parameter exponential 
family 

Exponential waiting times, 22, 54, 

74. See also Exponential 
distribution 

Extreme order statistic, 678, 679 

Factorization criterion for sufficient 
statistics, 19, 45, 46 
False discovery rate, 354, 386 
Family of hypotheses, 349, 374 
Familywise error rate (FWER), 349, 
354, 355, 372, 386; control 
based on bootstrap, 658-661 
Fatou’s Lemma, 32 
E-distribution, 158; for simultaneous 
conhdence intervals 381; in 
Hotelling’s T 2 -test, 306; in 
tests and conhdence intervals 
for ratio of variances, 166, 299; 
noncentral, 307; relation to 
beta distribution, 159. See also 
F -test for linear hypothesis; 
E-test for ratio of variances 
Fiducial, 108; distribution 175; 

probability, 108, 175 
Fieller’s problem, 197 
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Finite decision problem, 54 
First-order accurate, 666 
Fisher’s exact test, 127, 149. See also 
Two by two tables 
Fisher Information, 485, 486 
Fisher’s least significant difference 
method, 368 

Fisher linkage model, 598 
Fisher’s 2 -transformation, 439 
Fixed effects model, 297. See also 
Linear model; Model I and II 
Free Group, 25 
Frequentist point of view, 175 
Friedman’s rank test, 290 
F-test for linear hypothesis, 280; 
admissibility of, 281; as 
Bayes test, 309; for nested 
classification, 302; has best 
average power, 308; in Fisher’s 
least significant difference 
method, 368; in Gabriel’s 
simultaneous test procedure, 
368; in mixed models, 426; in 
model II analysis of variance, 
299; power of, 281; robustness 
of, 445, 446, 448, 480, 491 See 
also F-distribution 

F-test for ratio of variances, 106, 107, 
220, 238; admissibility of, 239; 
nonrobustness of, 446. See 
also F-distribution; Normal 
two-sample problem, ratio of 
variances 

F-test in multiple comparison 
procedures, 366 
Fubini’s theorem, 34 
Fully informative statistics, 96 
Functionals, 571 

Fundamental lemma, see Neyman- 
Pearson fundamental 
lemma 

Gabriel’s simultaneous test procedure, 
368 

Gamma distribution T(g,b), 99, 196; 
relation to Beta distribution, 
196; scale parameter of, 201; 
shape parameter of, 196. 

See also Beta distribution; 
Chi-squared distribution; 
Exponential distribution 
Gaussian curvature, 341 
Generalized linear models, 318 


Ghosh-Pratt identity, 200 
Glivenko-Cantelli theorem, 441 
Goodness of fit test, vii, 256 583; 
bootstrap tests of, 673; 
in multinomial models, 
514-516; See also Chi-squared 
tests; Kolmogorov-Smirnov; 
Neyman’s smooth tests; 
Separate families; Weighted 
quadratic tests 

Group: amenable, 334; free, 25; 
generated by subgroups, 

217; linear, 216, 227, 334; of 
monotone transformations, 
215; orthogonal, 215, 217, 

330;; permutation, 215; scale, 
215; transformation, 212, 

213; transitive, 215, 220; 
translation, 215, 219, 333. See 
also Equivariance; Invariance 
Group, 692, 693; family, 395, 401 
Guaranteed power: achieved through 
sequential procedure, 124, 126, 
198, 199 

Haar measure, 227, 331 
Hazard ordering, 101 
Hcllinger distance, 530-534, 582 
Hierarchical classification, see Nested 
classification 

Higher order asymptotics, 661-668 
Highest probability density (HPD) 
credible region, 173, 175, 202 
Hilbert space, 696-698 
Hodges-Lehmann efficiency, 539 
Hodges’ superefficient estimator, 525 
Holm procedure for multiple testing, 
350, 351, 363, 385 

Homogeneity of means: tests of, 285; 
against ordered alternatives, 
287; multiple comparisons for, 
364, 366; for normal means, 
285, 287; nonparametric, 286, 
290, 458. See also Multiple 
comparisons 
Homomorphism, 12 
Hotelling’s T 2 -test, 306; admissibility 
of, 317; as Bayes solution, 317; 
minimaxity of, 335 
HPD region, see Highest probability 
density 

Huber condition 455 
Hunt-Stein theorem, 331 



774 Subject Index 


Hypergeometric distribution, 66, 

134; in testing equality of two 
binomials, 127; in testing for 
independence in a 2 x 2 table, 
131; relation to distribution 
of runs, 146. See also Fisher’s 
exact test; Two by two tables 
Hypergeometric function, 209 
Hypothesis testing, 5, 56; history of, 
107; loss functions for, 59, 69, 
222; without stochastic basis, 
131, 132 

Improper prior distribution, 172 
Inadmissibility, 17; of confidence 

sets for vector means, 335; of 
likelihood ratio test, 263; of 
UMP invariant test, 306. See 
also Admissibility 
Independence: conditional, 133; of 
sample mean from function 
of differences in normal 
samples, 152; of statistic from 
a complete sufficient statistic, 
152; of sum and ratio of 
independent variables, 153; 
of two random variables, 34 
Independence, test for: in bivariate 
normal distribution, 191; in 
nonparametric models, 241, 
271; in r x c contingency 
tables, 127; in two by two 
tables, 127-130 
Indicator function of a set, 33 
Indifference zone, 320 
Inference, statistical, see Statistical 
inference 

Information matrix, 485, 486 
Integrable function, 31 
Integration, 31 

Interaction, 291, 292, 311; as main 
effects, 311; in random effects 
and mixed models, 313, 314; 
test for absence of, 291 
Interval estimation, see Confidence 
intervals 

Into, see Transformation 
Intraclass correlation coefficient, 313 
Invariance: of decision procedure, 

12, 13; of likelihood ratio, 

341; of measure, 299, 518, 

519; and admissibility, 26; 
and ancillarity, 395, 397, 401; 


and symmetry, 212; history 
of, 276; of likelihood ratio, 

262; of measure, 227; of power 
functions, 227-229; of tests, 
214, 276; principle of, 214; 
relation to equivariance, 13; 
relation to minimax principle, 
25, 329; relation to sufficiency, 
220; relation to unbiasedness, 
23, 229, 230; warning against 
inappropriate use of, 286. 

See also Almost invariance; 
Equivariance 

Invariant measure, 227, 230; over 
orthogonal group, 330; over 
translation group, 333 
Inverse Gaussian distribution, 100, 
197 

Inverse sampling: for binomial trials, 
67; for Poisson variables, 68, 
98. See also Negative binomial 
distribution; Poisson process; 
Waiting times 

Jackknife, 648, 674 
Joint confidence rectangles 657. See 
also Simultaneous confidence 
sets 

Kendall’s statistic, 272 
fc-FWER, 374, 386 
Kolmogorov-Smirnov: and bootstrap 
confidence bands, 658; 
asymptotic behavior of, 441, 
442, 584-589; based on a pivot, 
645; extensions of, 589-590; 
statistic, 256; test for goodness 
of fit, 256. See also Goodness 
of fit 

Kolmogorov-Smirnov distance, 441 
Kruskal-Wallis test, 286 
Kullback-Leibler information (or 

divergence) 432; backward, 672 
Kurtosis, 459 

Large-sample theory, vii, 417 
Latin squares design, 293, 312 
Lattice distribution, 459 
Laws of large numbers: Weak, 431; 

Strong, 441; Uniform, 463, 464 
Least favorable distribution, 18, 84, 
85, 86, 321, 361 
Least squares estimates, 281 
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Lebesgue convergence theorems, 39 
Lebesgue integral, 31 
Lebesgue measure, 29 
Legendre polynomials, 599, 600 
Level of significance, see Significance 
level 

Levy distance, 430 

Life testing, 54. See also Exponential 
distribution; Poisson process 
Likelihood, 16; function, 503, 504 See 
also Maximum likelihood 
Likelihood ratio, 15, 101, 494; 
censored, 326; invariance 
of, 262; large-sample theory 
of, 494, 503; monotone, 65; 
preference order based on, 60, 
66; sufficiency of, 53. See also 
Monotone likelihood ratio 
Likelihood ratio test, 16; example of 
inadmissible, 263; large-sample 
theory of, 513-517; using 
bootstrap critical values, 670, 
671 

Lindley’s Paradox, 95 
Linear functionals 571; LAUMP 
property, 572-574 
Linear hypothesis, 277, 333; 

admissibility of test for 281; 
Bayes test for, 309; canonical 
form for, 278, 317; E-test 
for, 200; inhomogeneous, 

283; more efficient tests for, 
287; parametric form of, 284, 
309; power of test for, 280; 
properties of test for, 280, 

308, 333, 338, 341; reduction 
of, through invariance, 279; 
robustness of tests for, 
451-458. See also Analysis 
of variance; Additive linear 
model, Generalized linear 
model 

Linear model, 277, 318; confidence 
intervals in, 309; history of, 
317; simultaneous confidence 
intervals in, 380 

Locally asymptotically uniformly 

most powerful (LAUMP): for 
equivalence hypotheses, 559- 
564; for one-sided hypotheses 
in multiparameter models, 
553-559; in nonparametric 


models, 572; in univariate 
models, 544-549 

Locally most powerful rank test, 244, 
275 

Locally optimal tests, 322, 339, 340, 
403, 511 

Locally unbiased, 340 

Local power 433; of t-test 465, 466 

Location families (or models), 70, 

100, 396; are stochastically 
increasing 70; comparing two, 
219; conditional inference for, 
414; condition for monotone 
likelihood ratio, 323, 401; 
example lacking monotone 
likelihood ratio, 71; LAUMP 
tests for, 546-548; strongly 
unimodal, 401 

Location-scale families, 12; confidence 
intervals based on pivot, 

645; comparing two, 258; 
LAUMP tests in, 557. See also 
Normality, testing for 
Log convexity, 323, 412 
Logistic distribution, 134, 323, 402 
Logistic response model, 134 
Loglikclihood ratio, 483; expansion 
due to Le Cam, 489-491 
Loglinear model, 134, 318 
Lognormal family, 488 
Loss function, 3, 7; in confidence 

interval estimation, 23, 72, 76; 
in hypothesis testing, 69, 141, 
222; monotone, 76 
L v - space, 697, 698 

Main effects, 287, 292; as interactions, 
311; confidence sets for, 289; 
tests for, 287, 291. 

Mallow’s metric, 654 
Mantel-Haenszel test, 135 
Markov chain, 145 
Markov property, 145 
Markov’s inequality, 472 
Matched pairs, 138, 183, 221, 239, 

324; comparison with complete 
randomization, 149; confidence 
intervals for, 189; rank tests 
for, 242, 246 

Maximal invariant, 214; ancillarity 
of, 395; distribution of, 218; 
method for determining, 216; 
obtained in steps, 217 
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Maximin multiple tests, 354, 357, 

358, 360 

Maximin test, 320; by Hunt-Stein 
theorem 333; existence of, 

338; local, 322; relation to 
invariance, 329. See also 
Least favorable distribution; 
Minimax principle; Most 
stringent test 

Maximum likelihood, 16, 17, 504-508; 
in normal model, 504, 505; in 
exponential family models, 

505. See also Likelihood ratio 
test 

Maximum modulus confidence 
intervals, 379 
McNemar’s test, 138, 149 
Measurable: function, 30; set, 29; 

space, 29; transformation, 30, 
34 

Measure, 29 

Median, confidence bounds for, 105 
Median unbiasedness, 22; relation to 
confidence bounds, 162 
Meta-analysis, 109 
Metric space, 527, 571, 694. See 
also Hellinger; Kolmogorov- 
Smirnov; Kullback-Leibler; 
Levy; Mallows; Prohorov, 
Total variation 

Minimal complete class of decision 
procedures, 17. See also 
Completeness of a class of 
distributions; Essentially 
complete class of decision 
procedures 

Minimal sufficient statistic, 21 
Minimum Chi-squared estimator, 597 
Minimax principle, 15, 347; and 
least favorable distribution, 

18; in confidence estimation, 
335; relation to invariance, 

25; relation to unbiasedness, 
24. See also Maximin test; 
Restricted Bayes solution 
Minkowski’s inequality, 697 
Missing observations, 410 
Mixed model, 297, 304, 314, 315 
Mixtures of experiments, 392, 394, 
395, 410, 414 

MLR, see Monotone likelihood ratio 
Model I and II, 297. See also Mixed 
model; Random effects model 


Model selection, 11 
Monotone class of sets, 50 
Monotone convergence theorem, 32 
Monotone decision rule, 355, 357, 387 
Monotone likelihood ratio, 65, 

69, 101, 104; mixtures of 
distributions with, 341, 

401, 403; necessary and 
sufficient condition for, 98; of 
differences, 402; of distribution 
of correlation coefficient, 261; 
of exponential family, 67; of 
location families, 323, 401, 

402; of noncentreal y 2 an( i F , 
307; of noncentral t, 224; of 
scale families, 324; relation to 
total positivity, 103; tests and 
confidence procedures in the 
presence of, 65, 69, 73. See also 
Stochastic increasing 
Monotone loss function, 76 
Monte Carlo simulation 442, 443; 
for bootstrap, 649; for 
subsampling, 679 
Mortality, see Hazard ordering 
Most stringent test, 276, 337; 

existence of, 346 
Moving average process, 450 
Moving blocks bootstrap, 687 
Multinomial distribution, 47, 202; as 
conditional distribution, 54; 
Dirichlet prior for, 202; for 
entries of 2 x 2 table, 128 
Multinomial model: maximum 

likelihood estimation in, 514, 
515; testing a composite 
hypothesis in, 597, 598; 
testing a simple hypothesis 
in, 514-516, 590-597; for 
2x2 table, 128, 130; for 
three-factor contingency table, 
133. See also Chi-squared test; 
Contingency tables 
Multiple comparison procedures, 
iii, 293, 343; complexity 
of, 373; history of, 391; 
interpretability of, 372; 
significance levels for, 368, 

370, 371. See also Duncan and 
Dunnett multiple comparison 
methods; Newman-Keuls 
multiple comparison 
procedure; Simultaneous 
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confidence intervals; 

Stepdown procedures; Stepup 
procedures; Tukey levels; 
Tukey’s T-method 
Multiple decision procedures, 

5. See also Multiple 
comparisons; Multiple testing; 
Three-decision problems 
Multiple testing, iii, 293, 348; history 
of, 391, maximin procedures, 
354 

Multiplicity problem, 349 
Multivariate cumulative distribution 
function, 424 

Multivariate linear hypothesis, 306, 
318 See also Linear hypothesis 
Multivariate mean: nonparametric 
confidence regions based on 
bootstrap, 655, 656; multiple 
testing for, 661 

Multivariate normal distribution, 

89, 304, 426; testing linear 
combination of means 90, tests 
for, 345, 513, 514. See also 
Bivariate normal distribution 
Multivariate normal one-sample 

problem, the mean: confidence 
intervals for, 415; tests 
for, 305, 335, 353. See 
also Hotelling’s T 2 -test; 
Simultaneous confidence sets 
Multivariate t-distribution, 275 

Natural parameter space of an 

exponential family, 48, 55, 119 
Negative binomial distribution 22, 68, 
144 

Neighborhood model, 326, 328 
Nested classification, 301, 313 
Nested rejection regions, 63, 96, 105 
Newman-Keuls multiple comparison 
procedure, 368, 370 
Newton’s identities, 39 
Neyman-Pearson fundamental lemma, 
60, 108; approximate version 
of, 326; generalized, 77, 108 
Neyman-Pearson statistic, 503 
Neyman’s smooth tests, 599-601; 

large sample behavior 601-607 
Neyman structure, 115, 118 
Noncentral: beta distribution, 280, 
307; ^-distribution, 306, 

311; F-distribution, 307; 


f-distribution, 156, 161, 193, 
224 

Noninformative prior, 172 
Nonparametric: independence 
problem, 191, 240, 242; 
many-sample problem, 286; 
methods for linear hypotheses, 
290; one-sample problem, 118; 
test in two-way layout, 290. 
See also Permutation test; 
Rank tests; Sign test 
Nonparametric mean 420, 459; and 
the Bahadur-Savage result; 
466-468; and the bootstrap, 
653, 655; and Edgeworth 
expansions, 459-462; and the 
t-test, 462-466; asymptotic 
maximin and LAUMP 
property, 567-574; confidence 
intervals for based on a root, 
646, 647; resampling-based 
tests for 672, 673. See also 
Multivariate mean 
Nonparametric two-sample problem, 
130, 176, 242; confidence 
intervals in, 188, 203, 268; 
omnibus alternatives, 245; 
universally unbiased test in, 
269. See also Normal scores 
test; Wilcoxon test 
Nonparametric test, 85 
Nonparametric variance, LAUMP 
property, 574 

Normal approximation, order of error, 
663, 664 

Normal distribution N(£,a 2 ), 5, 86; 
loglikelihood for, 483; testing 
against Cauchy or double 
exponential, 259; testing 
against uniform or exponential, 
260. See also Bivariate normal 
distribution; Multivariate 
normal distribution 
Normality, testing for, 260, 589. See 
also Normal distribution 
Normal many-sample problem: 

confidence sets for vector 
means, 252, 336, 366, 375, 378; 
tests for means, 285, 399. See 
also Homogeneity of means, 
tests of 

Normal one-sample problem, the 
coefficient of variation: 
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confidence intervals for, 273; 
test for, 157, 224, 294, 303 
Normal one-sample problem, the 
mean: admissibility of test 
for, 235; AUMP test for, 555, 
556; confidence intervals for, 
163, 250, 405; credible region 
for, 172m 174; Edgeworth 
expansion for f-statistic, 517; 
LAUMP test of equivalence 
with unknown variance, 563, 
564; likelihood ratio test for, 
87; median unbiased estimate 
of, 164; nonexistence of test 
with controlled power, 157; 
nonexistence of UMP test 
for, 89; optimum test for, 92, 
155, 156, 260, 283, 401; test 
for, based on random sample 
size, 95; two-stage confidence 
intervals for, of fixed length, 
198, 199; two-stage test 
for, with controlled power, 

199; two-sided test for, 260; 
sequential confidence intervals 
for, 163, 199. See also Matched 
pairs; t-test 

Normal one-sample problem, the 

variance: admissibility of test 
for, 238; conditional confidence 
intervals for, 415; confidence 
intervals for, 165, 201; credible 
region for, 174; likelihood ratio 
test for, 87; optimum test for, 
87, 92, 154, 220, 325 
Normal response model, 134 
Normal scores statistic, 269 
Normal scores test, 243; optimality 
of, 243, 244 
Normal subgroup, 257 
Normal two-sample problem, 
difference of means: 
comparison with matched 
pairs, 204; confidence intervals 
for, 165; credible region for, 
202; optimal tests for for 
(with variances equal), 107, 
160, 195, 225, 260, 284. See 
also Behrens-Fisher problem; 
Homogeneity of means, tests 
of; t-distribution; f-test 
Normal two-sample problem, ratio of 
variances, 107, 157, 220, 238; 


confidence intervals for, 166, 
254, 272; credible region for, 
202; nonrobustness of test for, 
446; test for, 107, 157, 259. 
See also E-test for ratio of 
variances; Ratio of variances 
Nuisance parameters, 318, 402 
Null set, 40 

Odds ratio, 126, 399; most accurate 
unbiased confidence intervals 
for, 200. See also Binomial 
probabilities; Contingency 
table; Two by two tables 
One parameter exponential family, 

67, 81, 111; complete class for, 
141; most stringent test in, 
338. 

One-sided hypotheses, 65, 124 
One-way layout, 285, 353; Bayesian 
inference for, 304; model II 
for, 297; nonparametric, 286. 
See also Homogeneity, tests of; 
Normal many-sample problem 
Onto, see Transformation 
Optimality, 9, 10 

Orbit of transformation group, 214 
Ordered alternatives, 287 
Order notation Op( 1), op( 1), 433; 

an bn , 498; an ^ bm 535 

Order statistics, 37, 38; as 

maximal invariants, 215; 
as sufficient statistics, 53, 

176; completeness of, 118, 

141; distribution of, 266; 
equivalent to sums of powers, 
38; expected values of, 243; in 
permutation tests, 176 
Orthogonal group, 215, 217, 330 
Orthogonal: transformations, 194, 
215; vector, 697 

Orthonormal: system, 697; vector, 
697 

Paired comparisons, see Matched 
pairs 

Pairwise sufficiency, 53 
Parameter space, 3 
Parameters, unrelated, see Variation 
independent parameters 
Parametric bootstrap, 651-653; in 
Behrens-Fisher problem, 671, 
672 
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Pareto distribution, 94, 196 
Parseval’s identity, 697, 698 
Partial ancillarity, 398, 399 
Partial sufficiency, 106 
Pearson’s Chi-squared test, see 
Chi-squared test 
Percentile method, 685 
Permutation group, 215 
Permutation test, 130, 177, 

187; approximated by 
standard f-test, 180, 447; 
as randomization test, 242, 
635, 641-643; complete class, 
186; computational methods 
for, 180; confidence intervals 
based on, 189, 203, 206; for 
testing independence, 192; 
history of, 210, 690; most 
powerful, 178; robustness of, 
447, 638-643; most stringent, 
346. See also Nonparametric; 
Randomization model 
Pillai-Bartlett trace test, 463; 
robustness of, 465 

Pitman asymptotic relative efficiency. 
see Asymptotic relative 
efficiency 

Pivotal: method, 644-646, quantity, 
253, 274 

Plug-in estimate, 648 
Point estimation, viii, 5, 7; 

equivariant, 13; history of, 27; 
unbiased, 14 

Pointwise asymptotically level a: for 
confidence sets, 423; for tests, 
422 

Pointwise consistent in power, 423 
Poisson distribution, 4, 6, 54; 
comparison of two, 125, 

398; relation to exponential 
distribution, 27, 68, 98; 
square root transformation 
for, 474; sufficient statistics 
for, 19; sums of, 54. See also 
Exponential distribution; 
Poisson parameters; Poisson 
process 

Poisson model: for 2x2 table, 130, 
132; for 2x2 x K table, 133, 
148 

Poisson parameters: comparing two, 
125, 398; confidence intervals 
for the ratio of two, 168; 


one-sided test for, 68, 98; 
one-sided test for sum of, 105 
Poisson process, 4, 68, 98; and 

2x2 tables, 130; confidence 
bounds for scale parameter, 

74; distribution of waiting 
times in, 22; test for scale 
parameter in, 68, 98. See also 
Exponential distribution 
Polya’s theorem, 429 
Polya frequency function, 323 
Population models, 132 
Portmanteau theorem, 425 
Positive dependence, see Dependence, 
positive 

Positive part of a function, 31 
Posterior distribution, 172; percentiles 
of, 175. See also Bayesian 
inference 

Posterior probability, 94 
Power function, 57; of invariant test, 
228; of one-sided test, 68; of 
two-sided test, 82 

Power of a test, 57, 98; conditional, 
124, 399; unbiased estimation 
of, 123 

Power series distribution, 142 
Preference ordering of decision 
procedures, 10, 14 
Prepivoting, 657, 668 
Prior distribution, 14, 172; improper, 
172; noninformative, 172. 

See also Bayesian inference; 
Least favorable distribution; 
Posterior distribution 
Probability density (with respect to 
/a), 33; convergence theorem 
for, 696 

Probability distribution of a 

random variable, 30. See 
also Cumulative distribution 
function (cdf) 

Probability integral transformation, 
97, 266 

Probability measure, 39, 30 
Product measure, 34 
Prohorov’s theorem, 440 
Projection, as maximal invariant, 

216, 284 

Pseudometric space, 694 
P-value, 57, 63, 97, 98, 108; 
combination of, from 
independent experiments, 97, 
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109; for randomization test, 
636; for randomized tests, 64; 
in multiple testing, 350, 364; 
in stepdown procedures, 360; 
properties of, 64, 139; versus 
fixed levels, 65 

Quadrant dependence, 145, 210, 371, 
372. See also Dependence, 
positive 

Quadratic mean derivative, 484 
Quadratic mean differentiable 
(q.m.d.) families, 484; 
examples of: 486, 488; 
loglikelihood expansion for, 
489; properties of, 485-487 
Quadrinomial distribution, 133 
Quality control, 85, 223 
Quantiles, 430, 649 

Rao’s score tests, see Score tests 
Radon-Nikodym derivative, 33, 51 
Radon-Nikodym theorem, 33 
Random assignment, 131, 182, 247, 
293 

Random effects model, 297; for 

nested classifications, 301, 313; 
for one-way layout, 297; for 
two-way layout, 313. See also 
Ratio of variances 
Randomization, 8, 293; as basis for 
inference, 182; possibility of 
dispensing with, 95; relation to 
permutation test, 184; tests, 
632-643. See also Random 
assignment; Randomized 
procedure 

Randomization distribution, 637 
Randomization hypothesis, 633 
Randomization models, 132, 187; 
confidence intervals in, 188; 
history of, 210 

Randomized procedure, 8; confidence 
intervals, 167; in conditioning, 
414 

Randomized test, 58; representation 
as nonrandomized test, 74 
Randomness, hypothesis of, 270 
Random sample size, 95, 142, 210 
Random variable, 30 
Rank correlation coefficient, 272 
Ranks, 216; as maximal invariants, 
216, 241; distribution under 


alternative, 265, 266; null 
distribution of, 242. See also 
Signed ranks 

Rank-sum test, 147. See also 
Wilcoxon test 

Rank tests, 241; as special case 
of permutation tests, 

635, 636; in multivariate 
problems, 318; surveys of, 

286. See also Nonparametric; 
Nonparametric two-sample 
problem; Symmetry; Trend 
Ratio of variances: confidence 

intervals for, 166, 254, 272, 
299, 558; in model II, 299; 
tests for, 157, 220, 259, 298, 
412. See also F -test for ratio of 
variances; Homogeneity, tests 
of; Random effects model 
Recognizable subsets, see Relevant 
subsets 

Rectangular distribution, see Uniform 
distribution 

Regression, 169, 318, 395; as linear 
model, 278, 293; comparing 
several lines, 295, 312; 
confidence band for, 384, 

391; confidence intervals 
for coefficients, 223, 295; 
intercepts and ordinates 
of line, 170; polynomial, 

278; robustness of tests for, 
451-458; tests for coefficients, 
169, 293; with both variables 
subject to error, 312. See also 
Trend 

Regression dependence, 191, 240. See 
also Dependence, positive 
Regular (estimator sequence), 508, 
526 

Relative efficiency, 539. See also 

Asymptotic relative efficiency 
Relevant and semirelevant subsets, 
175, 405, 406, 413; history of, 
414, 415; randomized version 
of, 414; relation to Bayesian 
inference, 415 

Restricted Bayes solution, 15 
Riemann integral, 31 
Risk function, 4, 9, 10 
Robustness, 11, 347; against 

dependence, 448-451, 680; 
against U-test of means, 445, 
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446, 448, 480; of efficiency, 421; 
of general linear models tests, 
451-458 ; of validity, 421; lack 
of, for F -test of variances, 446; 
lack of, for Chi-squared test 
of a normal variance, 445; of 
test of independence or lack of 
correlation, 476; for tests in 
two-way layout, 455; of t-test, 
444, 445. See also Adaptive 
test; Behrens-Fisher problem; 
Permutation test; Rank tests 
Root, 644 

Runs test, for testing independence 
in a Markov chain, 145, 146 

Sample, 5; haphazard, 181; stratified, 
176, 182, 188 

Sample correlation coefficient, 190, 
207; distribution of, 209; 
limiting distribution of, 438; 
monotone likelihood ratio of 
distribution, 261; variance 
stabilizing transformation for, 
438, 439. See also Bivariate 
normal distribution; Rank 
correlation coefficient 
Sample covariance matrix, 305, 316; 

distribution of, 208 
Sample distribution function, 

see Empirical cumulative 
distribution function 
Sample inspection: by attributes, 66, 
223; by variables, 85, 223; for 
comparing two products, 135, 
225 

Sample size, 8; required to achieve 
specified power, 57. 125. 199. 
320 

Sample median, 429 

Sample space, 30 

Sample standard deviation, 434 

S-ancillary, 398, 399 

Scale families, 324; comparing 

two, 259, 412; conditional 
inference for, 414; condition 
for monotone likelihood ratio, 
323 

Scheffe’s S-method, 375, 380, 384, 
388; alternatives to, 384 
Score tests, 511-513; asymptotically 
maximin property, 566, 567; 
asymptotical relative efficiency 


of, 536 AUMP and LAUMP 
property, 545; counterexample 
to AUMP property, 547 
Score vector (or function), 489, 511 
Second-order accurate, 666 
Selection procedures, 102 
Separable: family of distributions, 
698; space, 694 

Separate families of hypotheses, 220, 
258 

Sequential procedures, 8, 9, 145, 157, 
163 

Shift, confidence intervals for: based 
on permutation tests, 203; 
based on rank tests, 251, 

268. See also Behrens-Fisher 
problem; Exponential 
distribution; Nonparametric 
two-sample problem; Normal 
two-sample problem, difference 
of means 

Shift model, 134, 250, 578, 579 
(j-field, 29; with countable generators, 
699 

(j-finite, 29 

Signed ranks, 242; distribution 

under alternatives, 270; null 
distribution of, 246 
Significance level, 57; for multiple 
comparisons, 368, 370; for 
stepdown procedures, 351, 361; 
nominal, 387. See also P-value 
Significance probability, see P-value 
Sign test, 85; asymptotic relative 
efficiency of, 537, 538; for 
matched pairs, 138; for testing 
consumer preferences, 135; for 
testing symmetry with respect 
to a given point, 137; history 
of, 149; in double exponential 
distribution, 342; limiting 
behavior, 501, 502; treatment 
of ties in, 167, 186. See 
also Binomial probabilities; 
Median; Sample inspection 
Similar test, 110, 115; relation to 

unbiased test, 111; history of, 
149. 

Simple: class of distributions, 59; 

hypothesis, 59 
Simple function, 31 
Simple hypothesis vs. simple 

alternative, 60, 415; with 
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large samples, 503. See also 
Neyman-Pearson fundamental 
lemma 

Simpson’s paradox, 132 
Simultaneous confidence intervals, 
375, 391; bootstrap, 657; for 
all contrasts, 382. See also 
Confidence bands; Dunnett’s 
multiple comparison method; 
Scheffe’s S-method; Tukey’s 
T-method 

Simultaneous confidence sets for a 

family of linear functions, 375, 
381; smallest, 378; taut, 378 
Simultaneous testing, 349. See also 
Multiple comparisons 
Single step procedure for multiple 
testing, 351 

Singly truncated normal distribution 
(STN), 144 
Skewness, 459, 662 
Slutsky’s theorem, 433 
Small-sample theory, iii 
Smirnov test, 245 
Smooth function of means, 656 
Spherically symmetric distributions, 
194, 314 

Stagewise tests, 367 
Standard confidence bounds, 77, 175 
Starshaped, 101 
Stationarity, 145 
Statistic, 30, 34; and random 
variables, 31; equivalent 
representations of, 36; fully 
informative, 96; subfield 
induced by, 34 

Statistical inference, 3; and decision 
theory, 6; history of, 27 
Stein’s two-stage procedure, 198 
Stepdown procedures, 351, 352, 391; 
canonical form for, 360; large 
sample bootstrap, 658-661 
Stepup procedures, 351, 356 
Stochastically increasing, 70, 135 
Stochastically larger, 70, 101, 240, 
354 

Stratified sampling, 176, 182, 188 
Strictly unbiased, 112 
Strongly unimodal, 323, 401, 412 
Studentization, 286, 445 
Studentized range, 367, 390 
Student’s t-distribution, see 
f-distribution 


Student’s t-test, see t -test 
Subfield, 34 

Sufficient statistic, 19, 44, 54, 

55; Bayes definition of, 21; 
factorization criterion for, 19, 
45; for exponential families, 

47; in presence of nuisance 
parameters, 96; likelihood ratio 
as, 53; minimal, 21; pairwise, 
53; relation to ancillarity, 397; 
relation to fully informative 
statistic, 96; relation to 
invariance, 220; statistics 
independent of, 151, 152. See 
also Partial sufficiency 
Subsampling, 673-676; comparisons 
with bootstrap, 677-680; for 
hypothesis testing, 680, 681 
Superefficient estimator, 525; 
bootstrap of, 679 

Symmetric: confidence interval, 649 
distribution, 53 

Symmetry, 11, 13; and invariance, 12, 
212; sufficient statistics for 
distributions with, 53; testing 
for, 241, 246, 270; testing, with 
respect to given point, 137, 
246, 248, 270 

Tautness, 378 

t-distribution, 156, 161, 286; 

approximation to permutation 
distribution, 180; as 
distribution of function of 
sample correlation coefficient, 
207; as posterior distribution, 
174; Edgeworth expansion for, 
517; in two-stage sampling, 
198; monotone likelihood ratio 
of, 224; multivariate, 275; 
noncentral, 156, 161, 193, 224 
Test (of a hypothesis), 5, 56; 

almost invariant, 225, 241; 
conditional, 394, 400, 403; 
invariant, 214, 276; locally 
maximin, 322; locally most 
powerful 339; maximin, 322; 
most stringent, 337; of type D 
and E, 340, 341; randomized, 
58, 127; strictly unbiased, 112; 
unbiased, 110; uniformly most 
powerful (UMP), 58 
Three-decision problems, 81, 124 
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Three factor contingency table, 132 
Ties, 136 

Tight sequence, 439 
Time series models, 450, 451 
Total positivity, 71, 103, 115, 308, 323 
Total variation distance, 529 
Transformation: into, 30; of integrals, 
34; onto, 30; probability 
integral, 97; variance 
stabilizing, 439 

Transformation group, 12, 212, 213. 

See also Invariance; Group 
Transitive: binary relation, 569; 

transformation group, 285 
Trend: test for absence of, 271 
Triangular distribution, 259 
Trimmed mean, 647, 648 
f-test: admissibility of, 235, 237, 

281; as Bayes solution, 237; 
as likelihood ratio test, 25, 

87; comparison with Wilcoxon 
and sign tests, 537, 538; for 
matched pairs, 183, 204; for 
regression coefficients, 169, 

294; in linear hypothesis with 
one constraint, 281; local 
power of, 465, 466; one-sample, 
89, 156, 192, 260; optimality 
in nonparametric model, 
567-574, permutation version 
of, 180, 635, 638, 639; power 
of, 156, 192, 193; relevant 
subsets for, 408; robustness 
of, 445, 446; two-sample, 161, 
176; two-stage, 199; under 
local alternatives, 501; uniform 
asymptotic behavior, 465, 466. 
See also Normal one- and 
two-sample problem 
Tukey levels for multiple comparisons, 
368, 387 

Tukey’s T-method, 367, 374, 388, 

389, 390 

Two by two by K tables, 138, 148 
Two by two tables: alternative models 
for, 128, 130, 132; comparison 
of experiments for, 130; 
Fisher’s exact test for, 127, 
149; for matched pairs, 138, 
149; McNemar’s test for, 138, 
149; multinomial model for, 
128, 130; 5'-ancillaries for, 399. 
See also Contingency tables 


Two by two by two table, 135 

Two-sample problem, see Behrens- 
Fisher problem; Binomial 
probabilities; Exponential 
distribution; Matched pairs; 
Nonparametric two-sample 
problem; Normal two-sample 
problem; Permutation test; 
Poisson parameters; Shift, 
confidence intervals for; 
Two-by-two tables 

Two-sided alternatives, 81 

Two-way contingency tables, see 
Contingency tables; Two by 
two tables 

Two-way layout, 287, 290, 304; mixed 
models for, 314, 315; multiple 
testing in, 374; rank tests 
for, 290; reorganization of 
variables in, 311; robustness in, 
455; simultaneous confidence 
intervals in, 383; with one 
observation per cell, 287; 
with m observations per cell, 
290. See also Contingency 
tables; Interactions; Nested 
classifications; Two by two 
tables 

UMP invariant test, 150, 218, 219; 

admissibility, 232; conditional, 
404; conditions to be UMP 
almost invariant, 227; example 
of inadmissibility, 232; 
examples of nonuniqueness, 
231, 232; relation with UMP 
unbiased test, 230; trivial, 232. 
See also Invariance; Linear 
hypothesis 

Uniformly most powerful (UMP) 

test, 58, 108; conditional, 394, 
401, 403; examples involving 
two parameters, 93, 95; for 
exponential distributions, 93; 
for monotone likelihood ratio 
families, 65; for one-parameter 
exponential families; for 
uniform distribution, 92, 

99; in inverse Gaussian 
distribution, 100; in normal 
one-sample problem, 87, 88; 
in Weibell distribution, 99; 
nonparametric example of, 85 
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UMP unbiased test, 111; admissibility 
of, 139; example of 
nonexistence of, 140; for 
multiparameter exponential 
families, 119, 121, 150; for 
one-parameter exponential 
families, 111; for strictly 
totally positive families, 115; 
relation to UMP almost 
invariant tests, 230; via 
invariance, 150, 230. See also 
Unbiasedness 

Unbiasedness, 13, 27, 110; and 
admissibility, 26; and 
invariance, 23, 229, 230; and 
minimax, 24; and similarity, 
111; for confidence intervals, 
23, 131; for point estimation, 
14, 22, 27; for two-decision 
procedures, 13; of tests, 110, 
strict, 112. See also UMP 
unbiased test; Uniformly most 
accurate confidence sets 
Undetermined multipliers, 80 
Uniform confidence bands, 442 
Uniform distribution U(a,b), 9, 22; 
as distribution of probability 
integral transformation, 97; 
completeness of, 116, 141; 
discrete, 142; distribution of 
order statistics from, 267; not 
q.m.d., 488, 533; of p-values, 
64, 65; one-sample problem 
for, 92, 99, 413; relation to 
exponential distribution, 93; 
sufficient statistics for, 26; 
testing against exponential or 
triangular distribution, 260; 
other tests for, 480, 482 
Uniformly asymptotically level a: for 
confidence sets, 423, 424; for 
tests, 422 

Uniformly integrable, 472 
Uniformly most accurate confidence 
sets, 72, 73; equivariant, 249; 
minimize expected Lebesgue 
measure, 251; relation to 
UMP tests, 73; unbiased, 164. 
See also Confidence bands; 
Confidence bounds; Confidence 
intervals; Confidence sets; 
Simultaneous confidence 


intervals; Simultaneous 
confidence intervals and sets 
Unimodal, 412. See also Strongly 
unimodel 

Unrelated parameters, 398 
U-statistic, 678 

Variance components, see 

Components of variance 
Variance stabilizing transformation, 
439 

Variation diminishing, 71. See also 
Total positivity 

Variation independent parameters, 
398 

Vector space, 696-698 
Vitali’s theorem, 32 

Waiting times, 22, 98 

Wald tests and confidence regions, 

508-510, 646; efficiency of, 536; 
AUMP and LAUMP property, 
548, 549 

Weak compactness theorem, 700, 701 
Weak convergence, 425, 694 
Weak conditionality principle, 400 
Weibull distribution, 99 
Weighted quadratic test statistics, 
607, 608; examples of, 611, 
612; local power calculations, 
614, 615 

Welch approximate t-test, 231, 447, 
448 

Weleh-Aspin test, 231, 408 
Wilcoxon one-sample test, 246 
Wilcoxon signed-rank statistic, 269, 
493, 502, 503 

Wilcoxon signed-rank test, see 
Wilcoxon one-sample test 
Wilcoxon statistic, 268, 269; 

expectation and variance of, 
265 

Wilcoxon two-sample test, 243, 

245; alternative form of 265; 
comparison with T-test, 537. 
538; confidence intervals 
based on, 251; history of, 276; 
optimality of, 243, 244, 267, 
268 

Wilson confidence interval for 
binomial, 435, 647 

Yule’s measure of association, 129 



